Datasets are at the core of performant AI systems. We use our industry experience to create and share datasets that directly enable users to solve their business problems.


GermanQuAD stems from the insights on the existing datasets and our labeling experience of working with enterprise customers. We combine the strengths of SQuAD with self-sufficient questions that contain all the relevant information for open-domain QA. This is a human-labeled dataset of 13,722 questions and answers.

Explore the dataset


GermanDPR is a document retrieval dataset following the format used in the Dense Passage Retrieval paper. It comprises 9,275 question/answer pairs in the training set and 1,025 pairs in the test set.

Explore the dataset


COVID-QA is a question answering dataset consisting of 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19.

Explore on Hugging Face's Model Hub