How to Evaluate a Question Answering System

Use quantifiable metrics  coupled with a labeled evaluation dataset   to reliably evaluate your Haystack question answering system


If you want to draw conclusions about a system’s quality, subjective impressions are not enough. Rather, you’d want to use quantifiable metrics — coupled with a labeled evaluation dataset — to reliably evaluate your question answering (QA) system.

Having an evaluation pipeline in place allows you to:

  • conduct informed assessments of your system’s quality,
  • compare the performance of different models, and
  • identify underperforming components of your pipeline.

In this tutorial, we’ll explain the concepts behind setting up an evaluation protocol for your QA system. We’ll then go through an example QA system evaluation.

Evaluation of QA Systems: An Overview

If you’ve already used the Haystack framework, you might appreciate its modular approach to building a working pipeline for extractive QA. Nonetheless, the result is a complex system for natural language processing (NLP) that poses some unique challenges in terms of evaluation.

An extractive QA system consists of one retriever and one reader model that are chained together in a pipeline object. The retriever chooses a subset of documents from a (usually large) database in response to a query. The reader then closely scans those documents to extract the correct answer.

During evaluation, you’ll want to judge how the pipeline is performing as a whole, but it’s equally important to examine both components individually to understand whether one is underperforming. If the reader shows low performance, you may need to fine-tune it to the specifics of your domain. But if the retriever is causing a bottleneck, you might increase the number of documents returned, or opt for a more powerful retrieval technique.

Datasets for evaluation

Your evaluation system should be based on manually annotated data that your system can be checked against. In a question answering context, annotators mark text spans in documents that answer a given query. (If you want to learn more about annotation in QA, check out our Haystack annotation tool guide!) Some datasets provide one answer per question, while others mark multiple options.

When a document does not contain the answer to a query, the annotators mark “None” as the correct answer to be returned by the evaluated system.

Open vs. closed domain

There are two evaluation modes known as “open domain” and “closed domain.”

Closed domain means single document QA. In this setting, you want to make sure the correct instance of a string is highlighted as the answer. So you compare the indices of predicted against labeled answers. Even if the two strings have identical content, if they occur in different documents, or in different positions in the same document, they count as wrong. This mode offers a stricter and more accurate evaluation if your labels include start and end indices.

Alternatively, you should go for open domain evaluation if your labels are only strings and have no start or end indices. In this mode, you look for a match or overlap between the two answer strings. Even if the predicted answer is extracted from a different position (in the same document or in a different document) than the correct answer, that’s fine as long as the strings match. Therefore, open domain evaluation is generally better if you know that the same answer can be found in different places in your corpus.

Retriever metrics

To evaluate your system’s quality, you’ll need easily interpretable metrics that mimic human judgment. Because the reader and retriever have different functions, we use different metrics to evaluate them.

When running our QA pipeline, we set the top_k parameter in the retriever to determine the number of answer candidates that the retriever returns. To evaluate the retriever, we want to know whether the document containing the right answer span is among those candidates.

Recall measures how many times the correct document was among the retrieved documents. For a single query, the output is binary: either a document is contained in the selection, or it is not. Over the entire dataset, the recall score amounts to a number between zero (no query retrieved the right document) and one (all queries retrieved the right documents).

In contrast to the recall metric, mean reciprocal rank takes the position of an answer (the “rank”) into account. It does this to account for the fact that a query elicits multiple responses of varying relevance. Like recall, MRR can be a value between zero (no matches) and one (the system retrieved the correct document as the top result for all queries).

Reader metrics

When evaluating the reader, we want to look at whether, or to what extent, the selected answer passages match the correct answer or answers. The following metrics can evaluate either the reader in isolation or the QA system as a whole. To evaluate only the reader node, we skip the retrieval process by directly passing the document that contains the answer span to the reader.

The name says it all. Exact match (EM) measures the proportion of documents where the predicted answer is identical to the correct answer. For example, for the annotated question answer pair “What is Haystack? — A question answering library in Python,” even a predicted answer like “A Python question answering library” would yield a zero score because it does not match the expected answer 100 percent.

The F1 score is more forgiving than the EM score, and more closely resembles human judgment as far as the similarity of two answer strings. It measures the word overlap between the labeled and the predicted answer. Thus, the two answers in the example above would receive an F1 score of one.

The accuracy metric is used in closed domain evaluation and a Reader will score 1 if the predicted answer has any word overlap with the label answer. Consider the pair of answers “San Francisco” and “San Francisco, California”. While F1 and EM would penalise these for not being exactly the same, accuracy would give them a perfect score. This metric is more reflective of the experience of the end user since in many use cases, the context around the predicted answer will also be provided to the user.

Semantic answer similarity

While F1 is more flexible than EM, it still does not address the fact that two answers can be equivalent even if they don’t share the same tokens. For example, both scores would rate the answers “one hundred percent” and “100 %” as sharing zero similarity. But as humans, we know that the two express exactly the same thing.

To make up for this shortcoming, a few members of the deepset team recently introduced the semantic answer similarity (SAS) metric. (The paper got accepted into the EMNLP conference 2021, which we’re very excited about!) It uses a Transformer-based cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap. SAS has become available in Haystack with the latest release and an article on how to use it is coming soon.

Example: Evaluating a QA System

For our real-life evaluation example, we’ll be giving a high level overview of this QA evaluation system tutorial. You can copy the notebook and follow along in Colab.

In short, the code in the tutorial notebook starts by:

  • initializing a document store
  • downloading a small slice of Google’s Natural Questions dataset as evaluation data
  • splits the documents into chunks with a PreProcessor
  • writes these documents into the document store
  • initializes the retriever and reader models

At this point, we can inspect the documents and labels as follows:

documents_ = document_store.get_all_documents(index=doc_index)
labels_ = document_store.get_all_documents(index=label_index)

How does our first document look?


>>> 'Book of Life - wikipedia Book of Life Jump to: navigation, search This article is about the book mentioned in Christian and Jewish religious teachings. For other uses, see The Book of Life...'

Next, let’s have a look at the corresponding label. The label contains quite a bit of information, but we’re only interested in the annotations which are stored under the ‘meta’ attribute:


>>> {'answer': 'every person who is destined for Heaven or the World to Come',
'created_at': '2021-09-16 15:20:17',
'document_id': '1b090aec7dbd1af6739c4c80f8995877-0',
'is_correct_answer': True,
'is_correct_document': True,
'meta': {},
'model_id': None,
'no_answer': False,
'offset_start_in_doc': 374,
'origin': 'gold_label',
'question': 'who is written in the book of life',
'updated_at': '2021-09-16 15:20:17'}

We can see that the label contains one answerable question for our first document, with one possible answer.

Different styles of evaluation

Haystack is flexible enough that you can evaluate either the Retriever or Reader in isolation, or both chained together in a joint pipeline. This second option is important because it is the clearest indication of how well your full system will run, but also because it shows the interaction of the two components. In this joint system, you can more accurately see how many of your errors are originating in the Reader and how many come from the Retriever.

There are two different ways you write code for evaluation. Both the Retriever and the Reader have eval() method that allows them to be evaluated individually. Note however that FARMReader.eval() only allows for closed domain evaluation. Alternatively you can place a EvalDocuments node after a Retriever or an EvalAnswers node after a Reader, or do both at the same time. By calling run, each of these nodes will evaluate the predictions being made by the Reader and Retriever. Once you have run all your queries, you can call them to print out the final metrics. Note that EvalAnswers currently only supports open domain evaluation.

Evaluating just the Retriever

Evaluating the retriever is as easy as calling the eval() method:

retriever_eval_results = retriever.eval(top_k=10, label_index=label_index, doc_index=doc_index)

With these settings you can expect:

>>> 0.43243243243243246
>>> 0.18421040087706764

If run in open domain mode as follows:

retriever_eval_results = retriever.eval(top_k=10, label_index=label_index, doc_index=doc_index, open_domain=True)

You can expect:

>>> 0.98
>>> 0.7978063492063492

Note that in both cases, evaluation top_k=10 meaning that the system scores 1 for recall when a correct document features in the top 10 retrieval results.

Evaluating the Reader

Next, we’ll look at the reader in isolation by calling eval() again:

reader_eval_results = reader.eval(document_store=document_store, device=device, label_index=label_index, doc_index=doc_index)

Let’s look at the scores for our reader:

>>> 74.23312883435584
>>> 75.08344498507394
>>> 95.39877300613497

Evaluating an entire pipeline

For the open domain evaluation of the whole question answering system at once, we need to add two special evaluation nodes to the pipeline: EvalDocuments for evaluating the retriever and EvalAnswers for the reader.

from haystack.eval import EvalAnswers, EvalDocuments

eval_retriever = EvalDocuments()
eval_reader = EvalAnswers()

We can now set up the evaluation pipeline:

from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
pipeline.add_node(component=eval_retriever, name="EvalRetriever", inputs=["ESRetriever"])
pipeline.add_node(component=reader, name="QAReader", inputs=["EvalRetriever"])
pipeline.add_node(component=eval_reader, name="EvalReader", inputs=["QAReader"])

We can now run the pipeline by looping over the labels. We’ll also store the evaluation results in a list, so that we can investigate them later.

labels = document_store.get_all_labels_aggregated(index=label_index)
results = []
for l in labels:
res =
params={"index": doc_index, "Retriever": {"top_k": 10}, "Reader": {"top_k": 5}},

We can now print different metrics from the pipeline’s nodes.


>>> Pipeline
queries: 50
top 1 EM: 0.4800
top k EM: 0.6800
top 1 F1: 0.5226
top k F1: 0.7394
(top k results are likely inflated since the Reader always returns a no_answer prediction in its top k)

In addition to the performance metrics, we can also look at how well the model did in terms of time:


>>> Retriever (Speed)
No indexing performed via
Queries Performed: 50
Query time: 0.6469987739988028s
0.012939975479976057 seconds per query

>>> Reader (Speed)
Queries Performed: 50
Query time: 121.42763020699931s
2.4285526041399863 seconds per query

Optimizing Question Answering Pipeline Performance with Haystack

Now that you know how to evaluate QA pipeline performance in Haystack, it’s time to build high-quality question answering systems that are tailored to your use case!

Start by heading over to our GitHub repository. If you like what you see, give us a star :)