Datasets are at the core of modern natural language processing. We use our industry experience to create and share NLP datasets that directly enable users to solve their business problems.
Semantic Answer Similarity
For the evaluation of the semantic answer similarity metrics as opposed to lexical-based metrics we compiled three datasets of pairs of answers to the same question that we labeled with regard to the semantic similarity of the answers. It comprises 3,659 annotated answer pairs.
GermanQuAD stems from insights on existing datasets and our labeling experience of working with enterprise customers. We combine the strengths of SQuAD with self-sufficient questions that contain all the relevant information for open-domain question answering. This is a human-labeled dataset of 13,722 questions and answers.