Modern Question Answering Systems Explained

Let’s walk through the main stages of a modern question answering system and demystify it a bit so that you can start to build your own one.


Branden Chan

Question answering (QA) has come along in leaps and bounds over the last couple years. Just one year ago, the SQuAD 2.0 benchmark was smashed overnight by BERT when it outperformed NLNet by 6% F1. Since then, steady gains have been made month to month and human level performance has already been exceeded by models such as XLNet, RoBERTa and ALBERT. This level of precision would have been hard to imagine in a pre-deep learning era and it shows just how far models have come in understanding both syntax and semantics. The figure below displays an answerable (left) and unanswerable (right) examples from the SQuAD 2.0 dataset:

The complexity, however, of working with the data and modeling pipeline is a high barrier to entry for those who want to tweak and improve these models. Almost all question answering datasets include documents that are longer enough not to fit in a standard Transformer model. As a result, there is a lot of engineering involved in decomposing a document into smaller passages, making individual predictions for each passage and then aggregating them into a single final prediction. This is a challenge that we tackled when reimplementing the question answering pipeline in our frameworks (Haystack and FARM) and here we will share what we learned while building a more developer friendly and performant way of doing QA. Here's a 10,000-foot overview of what different stages of a question answering pipeline are:

Before we get in to it, it’s worth stating that by question answering we actually are referring specifically to SQuAD 2.0 style extractive question answering where a model must either identify the span in a document which answers the given question (positive-answer) or else state that the answer cannot be answered by the document (no-answer).


In SQuAD, each document is a single paragraph from a Wikipedia article and each can have multiple questions. When using FARM, you will find that each document-question pair is stored in a data structure called Basket which will be given a single final prediction by the model. In practice, however, documents often contain more tokens than can be passed through the model. As a result, a document will be split into multiple passages and each passage-question pair is stored in a Sample object. Both Baskets and Samples are initialized at this point in FARM.

The input sequence that is fed into the model will contain tokens from the question, tokens from the passage, model specific special tokens and padding tokens. The length of these four combined must be less than the model’s max sequence length and this must be accounted for when we divide up our document into passages. Here is the function that splits up documents.

The labels for positive-answers are represented in the model by start and end token indices (e.g. [20, 54]). A no-answer is represented by a start and end at index 0 of the passage (i.e. [0, 0]). This usually means that start and end will land on the first special token (e.g. [CLS] in BERT).


An input sequence can be passed directly into the language model as is standardly done in Transfer Learning paradigm. For every token that enters the model, a contextualized word vector is returned. If you’re interested in understanding the finer details of this process, we highly recommend this illustrated blog post. If you’d simply like to play around with these models, FARM already has support for BERT, RoBERTa and XLNet (ALBERT coming soon).

When concatenated together, the word vectors form a matrix of shape S x D where S is the max sequence length and D is the number of dimension is each word vector. This is passed through a feed-forward network enclosed in the prediction head which will generate two logit vectors of length S, one for start and one for end. Each position in the vectors corresponds to a token in the input sequence. High values in the start vector signals that the model has confidence that the corresponding token is the start of the answer span. The positions with high logit values (dark blue) will likely be chosen as the start and end of the answer span:

Ano-answer prediction is represented by high logit values (dark blue) on the start and end vectors at index 0:

The score for a given span is calculated by adding together the start and end logits, so long as they form a valid start-end pair. A start-end pair may be considered invalid if the end comes before the start, or if either start or end falls on an invalid token such as the padding.


In our implementation, the aggregation layer first looks at each passage’s no-answer prediction. If no-answer is the top prediction in every passage, the document level prediction will also be no-answer. Otherwise, the final prediction is made by choosing the highest scoring positive-answer span across all passages. Note that this is made possible by the fact that the logits are not put through a softmax function and thus values even from different passages are directly comparable. Our implementation differs somewhat from the original BERT SQuAD implementation where the top positive-answer is compared against the lowest no-answer score. We found our implementation more intuitive and tests showed that the performance was no worse than the original.

Formatting Predictions

Once the model has returned a token answer span, our system extracts from the original document the string prediction as well as the character index at which this string starts. The latter is important in any kind of printout or visualisation setting where you would want to disambiguate between multiple matching string spans and also define a context around the answer. You can have a look at our interactive demo to start playing around with a trained QA model. Or if you’d rather get your hands dirty training and evaluating a model, here is the FARM QA Colab tutorial.


In this redesign of our QA code, we had both speed and SQuAD performance in mind. By incorporating multiprocessing deeply in our data processing pipeline, we were able to reduce the time it takes to prepare the SQuAD train and dev datasets from 20mins 50s using HuggingFace’s Transformers to just 42s on FARM (both on an AWS p3.8xlarge 32 core machine). It’s worth mentioning that these speed improvements have significant impact not just on training but also at inference time when your model is in production. Multiprocessing is only one way of speeding up computation and there will also be many more code optimizations to come in later FARM releases.

Using the training hyper-parameters shown below, we ran models for 2 epochs of the SQuAD dataset, amounting to around 10k batches of 32 (8.8k train + 1.2k dev). With 4 x NVIDIA V100 GPUs, the full run took just under 2 hours which we later found could be reduced by almost a half with a batch size of 60.

Our models also achieved highly competitive performance on the SQuAD 2.0 dataset as can be seen in the table below (c.f. the official leaderboard). These numbers will almost certainly increase if you use the large models instead of the base or if you train an ALBERT model (coming very soon to FARM).

Interestingly, while using FARM’s question answering code for a client dataset, we found that the performance of our model was often robust, even in cases where documents were hundreds of passages long. We see this as a strong sign that there is a bright future for question answering models in large scale industry applications.

One other significant learning to come out of our work was that dealing with large numbers of large documents could really stress the FARM pipeline. As a result, we are very happy to announce that we have open-sourced and will continue developing Haystack, an open source question answering framework that is designed to scale QA models for many long documents.


Currently we are seeing a significant shift in the field of applied natural language processing and neural search. Interest is moving from the more discretized text processing tasks like classification or Named Entity Recognition to tasks which test a system’s ability to truly understand the meaning of text. We at deepset are excited to sharing our learnings with the broader machine learning community through our open source frameworks. But also we hope that this short introduction might just give you the inspiration to develop your own question answering system!