Eliciting large sets from LLMs: What works and what doesn’t

Description:

Eliciting large sets, such as the list of all Physics Nobel Prize winners (201), all US counties (3007), or all species of mammals (~5940) from LLMs is important for knowledge base construction, as well as for analytical queries.
Even though LLMs store considerable knowledge in their parameters, context-size constraints and alignment towards moderate-sized chat answers mean they often reject to enumerate complete membership lists of large sets.

For example, by default, chatGPT refuses to list all Nobel Prize winners in Physics.

This thesis shall investigate how simple divide-and-conquer techniques can enable the elicitation of large entity sets from LLMs. It should investigate set-specific dimensions (e.g., decade, country), whether these can be elicited from the LLM itself, and whether these can reliably enable to obtain instances for large sets.

The contribution of this thesis would be in the development of a benchmark dataset, as well as in the design and evaluation of various prompting techniques.

References:

[1] Singhania, Sneha, Simon Razniewski, and Gerhard Weikum. “Extracting multi-valued relations from language models.” arXiv preprint arXiv:2307.03122 (2023).