Estimating object set sizes with LLMs and species sampling techniques


The size of many real-world sets, such as “Physicists”, “Pasta types”, or “Hard rock bands” is not easy to determine.

A survey of approaches can be found in [1]. Most notably, species sampling is a technique from ecology, where one repeatedly samples the set, and computes size estimates based on the amount of overlap between samples.

A critical issue is how to obtain such samples. Luggen et al. [2] used edit logs from Wikidata for this purpose, with somewhat underwhelming results. Since then, LLMs have arisen as a very prominent tool for knowledge extraction, and it is easy to obtain multiple samples from an LLM, by prompting it repeatedly with the same prompt, but a higher temperature.

The goal of this thesis is to investigate whether species sampling on LLM outputs provides a promising avenue towards set size estimation. The experiment setup can reuse major components from [2], but with LLMs instead of Wikidata edit logs as input.


[1] Razniewski, Simon, et al. “Completeness, Recall, and Negation in Open-World Knowledge Bases: A Survey.” ACM Computing Surveys (2024).

[2] Luggen, Michael, et al. “Non-parametric class completeness estimators for collaborative knowledge graphs—the case of wikidata.” The Semantic Web–ISWC 2019