Enabling German Neural Search: Announcing GermanQuAD and GermanDPR
Two new datasets, three models and a paper to push forward German NLP.
A model is only as good as the data it is fed. Anyone in the field of machine learning knows this. Great datasets like SQuAD and Natural Questions are the direct catalysts of the breakthroughs which have made neural search as powerful and flexible as it is today. Inspired by their successes, we set to work, building our own human annotated Question Answering and Passage Retrieval datasets in the German language and are very happy to announce the release of GermanQuAD and GermanDPR! We can confidently say that the models that we trained on this data are the first to perform well on these tasks in German and we’d like to encourage you to try them out!
We learned a lot in the process of creating these resources and are committed to sharing our learnings so that other teams can more easily build semantic search technologies for their language and domain. Everything we used to create them is open source: from the German pre-trained language models, to our frameworks FARM and Haystack where we trained and evaluated the models and even the annotation tool which our labellers used. By sharing our experience, we hope that other teams might find it less intimidating to create their own datasets and train models!
The GermanQuAD dataset adopts the format and style of SQuAD in that it tackles extractive QA, the passages all come from Wikipedia and all answers are single text spans. Here's a simplified question, passage, answer triple from the GermanQuAD dataset:
However, in various areas, we made some crucial design decisions that distinguish GermanQuAD from SQuAD. For example, we ensured that all questions are self sufficient. That is, that they contain enough information that they could be unambiguously answered even in an open-domain setting. We also encouraged annotators to write more questions which required answers that are longer than just a single phrase, so that models are better tuned to make predictions of different lengths. The table below reflects the number of passages, questions and answers in the train and test sets of GermanQuAD:
In the creation of the dataset, we also made sure to remove any duplicate question answer pairs and avoid cross-contamination between train and test set, a hidden problem that has been identified in other popular QA datasets. And we also asked our annotators to reformulate questions which share a lot of lexical overlap with the answer and tested the impact of this style of annotation generation.
These improvements were facilitated by the fact that we gathered a team of in-house annotators who we had regular close contact with rather than relying on crowd-sourced workers. Through regular review sessions, we got annotators to discuss and resolve discrepancies in their annotations.
Using the dataset, we trained and evaluated two QA models, which are already available for you to download and use through Hugging Face’s Model Hub. Just by providing the model names gelectra-base-germanquad and gelectra-large-germanquad you can already start using them in Haystack to perform document search or to empower your German chatbot with QA abilities. Below is the comparison of our new German QA models against XLM-R and the human baseline when evaluated on the GermanQuAD dataset. Top1 Acc is 1 for a given sample if there is any overlap between the prediction and the answer.
Both of these models use our pre-trained GELECTRA models as their starting point. They are trained on SQuADtranslate, a machine translated version of SQuAD, and then on GermanQuAD. This second stage of training brings a 5–9% absolute increase in Top-1 accuracy, thus highlighting the quality of GermanQuAD’s human annotated data. You can find more details around the training setup in their model cards of the base and large models.
Strong retrieval is crucial to an open-domain search system and the Dense Passage Retrieval method designed by the team at Facebook research once again highlighted the Transformers' potential to revolutionize all parts of the search pipeline.
Inspired by its successes, we saw the chance for us to create a German DPR dataset based off of GermanQuAD. Since the questions in the QA dataset are self sufficient, we could treat them as queries in a retrieval setting, and use the passages that contain the answer as gold label documents to be retrieved. After a few steps of preprocessing, we were left with a dataset of about 10k samples spread over the train and test splits. The image below show the same sample as the GermanQuAD example above, but converted to the DPR format of query, positive passage (blue) and negative passages (grey):
Following the original DPR paper, we also provided hard negative passages to each query by using BM25 to find the most relevant passages to that query, which don’t contain the answer string, and come from a wikipedia page that was not used in GermanQuAD.
Using this new retrieval dataset, we trained one of the first non-English DPR model, which is also already available for you to try on the Hugging Face model hub under the names deepset/gbert-base-germandpr-question_encoder and deepset/gbert-base-germandpr-ctx_encoder. It is based on the GBERT-base pre-trained language model and it is trained using two hard negatives and also in-batch negatives. Our experiments show that this model can perform significantly better than BM25. If you’d like more details about its parameters, its training regime, and the correlation between number of samples and model performance, have a look at the model cards here and here. Here are the Recall and Mean Average Precision percentage metrics of our GermanDPR model vs BM25 when evaluated on the test set of our new GermanDPR dataset:
deepset is committed to showing that natural language processing does not have to imply English NLP. Our German question answering and dense passage retrieval resources are available for you to use already and we hope they can form the basis of your neural search systems. We learned a lot on this journey and we hope that by sharing our experience, data and models, we might be able to empower other teams to build datasets and train models for their language and domain!