The following topics are available as Bachelor/Master theses. Please get in contact.

1. Recall-aware Information Extraction

Textual information extraction is a core step in the construction of many knowledge bases. In most cases, extracted facts are accompanied with a precision, i.e., a confidence in them being correct. The recall of extraction is then usually influenced by adapting the accepted precision.

In this project the goal is to learn a recall value for given entities and properties directly during the extraction process.
To this end, the focus would be on facts that are present in corpora in enumerations or connected with “and”, and thus can be jointly extracted.


  • John has two children, Bob and Mary –> A typical way to name ALL children
  • John brought his children Bob and Mary to school –> There could reasonably be other children that are e.g. too old or too young to be brought to school.

The technical work would consist of extending an existing information extraction system such as ClausIE towards adding a recall value to fact sets, and training it using distant supervision.

2. Metrics for Relative Completeness

Knowing how complete an entity in a knowledge base is is important both for editors and consumers. The Recoin tool [1] is an attempt at quantifying the completeness of entities in comparison with other entities, but its computation model is based on a very simple counting scheme, that gives unexpected results for a big number of entities.
The goal of this project is the development and evaluation of metrics for quantifying the completeness of knowledge base entities. The technical work would include the utilization of property ranking techniques [2] and entity similarity metrics [3].

3. Multilingual Coverage of Wikipedia

Wikipedia pages in different languages about the same entity often vary widely in size and content. The goal of this project is to quantify and qualify these differences, and to visualize them via a user script.

On the technical level, the idea is to pursue two approaches: (1) Multilingual topic modelling, to discover topics covered more/less in one article or the other, and (2) Interlinking, which can be further structured based on the information available in Wikidata. The results should be turned into a Wikipedia plugin, similar to Recoin.

4. Recommender System for Gliding

Gliding is both a recreational and competitive sport. On good days, glider pilots can be in the air up to 8 hours or more, during which they can cover significant distances (800km and more). Most glider pilots upload their competitive flights to an online platform (, where flights are daily listed and ranked using points that are based on the covered distance and the performance of the aircraft.

To achieve high points on a given day, glider pilots have to carefully choose a task (flight route) that, given their plane, skills and the weather conditions, allows them to cover the maximal distance. All of weather conditions, skills and plane can make a huge difference, as overestimating the weather conditions may lead to not completing the task, and as it is not uncommon that experienced pilots travel twice or more the distance that beginners travel.
The goal of this project is to develop a prototype of a gliding task recommendation system, which takes into account the factors mentioned above. The core component of the prototype will be the similarity function for tasks, which will then be used in a standard recommender systems framework (i.e., collaborative filtering or content-based filtering).

5. Exploiting existential information

Existential information, i.e., knowledge about numbers of facts that hold in reality (e.g., MPII has 5 departments) are recent addition to knowledge bases that classically focus on facts that link entities (e.g., D5 is a department of MPII). The goal of this work is to exploit existential information as derived by [1] in some part of the KB lifecycle, i.e., either information extraction, KB consolidation, or question answering.