Datasets are at the core of modern natural language processing. We use our industry experience to create and share NLP datasets that directly enable users to solve their business problems.

Semantic Answer Similarity

For the evaluation of the semantic answer similarity metrics as opposed to lexical-based metrics we compiled three datasets of pairs of answers to the same question that we labeled with regard to the semantic similarity of the answers. It comprises 3,659 annotated answer pairs.

Explore the dataset


GermanQuAD stems from insights on existing datasets and our labeling experience of working with enterprise customers. We combine the strengths of SQuAD with self-sufficient questions that contain all the relevant information for open-domain question answering. This is a human-labeled dataset of 13,722 questions and answers.

Explore the dataset


GermanDPR is a document retrieval dataset following the format used in the Dense Passage Retrieval paper. It comprises 9,275 question/answer pairs in the training set and 1,025 pairs in the test set.

Explore the dataset


COVID-QA is a question answering dataset consisting of 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19.

Explore on Hugging Face's Model Hub