More than Metrics: How to Test an NLP System

Our guide to evaluating applications that extract or generate natural language.

One of the trickiest tasks in natural language processing (NLP) systems is evaluating how well a particular language model functions. Developers need to test their NLP models’ quality, both in order to understand whether they solve the problem well and to compare different models to each other. However, it is often difficult to determine when a text correctly responds to a given input, because language can represent one idea in many completely different ways.

Consider the following three sentences:

  1. Today’s weather is sunny.
  2. The sun is shining brightly.
  3. It is a rainy day. 

If we were to evaluate the similarity of these sentences purely on the basis of the words that they use, we would have to come to the conclusion that they’re all equally different from each other. However, the meanings of sentences 1 and 2 are much closer to each other than to that of the third sentence. 

Because it’s usually difficult to define whether two statements in natural language mean the same thing, evaluating systems that extract or generate natural language is much more challenging than assessing the quality of NLP classification tasks like sentiment classifiers. For the latter, all an evaluation procedure needs to do is compare the correct label to the one predicted by the model. But to successfully evaluate an extractive or generative NLP system, it’s necessary to include qualitative judgments in addition to the quantitative measurements.

No single tool or methodology can solve the complex problem of NLP testing. In this article, we review some of the strategies that can help you understand the quality of a model, all of which you can incorporate into existing or upcoming projects.

Quantitative Evaluation

The quantitative evaluation of any machine learning system requires some annotated data: data points with associated labels, which represent the output that a well-trained system should return. We can then use metrics like accuracy or F1 to compute how close the system’s predictions got to the real labels. The choice of metric depends on the use case.  

It's possible to find labeled datasets in the real world, without hiring any extra workers for annotation. For example, the Yelp reviews dataset contains written reviews, each with an associated rating. The ratings were applied by the reviewers, and so no additional annotation is required. The dataset can be used to train sentiment classifiers, whose goal is to predict the correct rating by reading the review. In other cases, workers have to be hired and tasked with annotation, as for the creation of the SQuAD dataset for question answering.

The annotated dataset should accurately mirror the real-world use cases your system will encounter in production. If you use data that is very differently distributed from the real dataset, or that only captures certain problems and leaves out others, then the result of your evaluation won’t be representative of your system’s actual performance. 

For example, if your use case consists of building a textual recommendation system whose task is to return legal documents on the basis of semantic similarity, then it won’t be sufficient to train your model on general language data. Rather, you’ll want your model to encounter, during training, the same kind of legal texts that it will see once it is deployed to production.


Annotation is a deceptively complex task, due to the difficulty of putting real-world data into neatly separated categories, and the challenge of agreeing on annotation practices as a team. To make the task a bit easier, here are some tips:

  • Write annotation guidelines. They’re a crucial reference point for your team of annotators, and can later serve to document your data set and annotation process.
  • Make sure to test the annotation guidelines on your data, to understand if you’re really capturing all the use cases.
  • If you’re working with a team, train them and use measurements like an inter-annotator agreement to ensure that they’re annotating the data in a consistent manner. It also helps to set up a communication channel between annotators, where they can discuss certain edge cases, to further ensure the quality of the annotations.
  • If you can, use an annotation platform. It will make it easier to organize the data points and coordinate the work. Plus, a well-designed user interface will make it easier for your team to accomplish the task faster.

Once you have your evaluation data set in place, you can use it to compute a quantifiable metric of the quality of your system or systems.

Evaluating an NLP pipeline

Most modern NLP systems are assembled in pipelines — modular recipes that result in a sequence of models and data processing tools. Developing your NLP system using pipelines makes it easy to prototype and compare various different NLP components. Unfortunately (as if measuring the success of your model wasn’t already complicated enough), building NLP systems in a composite pipeline increases the difficulty of measuring outcomes

When you evaluate an NLP pipeline, you can check the prediction quality either of the entire system or of individual nodes. The metrics you use then depend on the task that the nodes accomplish. For example, to evaluate a retrieval node, you’ll be using a metric like recall, which measures the percentage of correctly retrieved documents.

When it comes to the evaluation of actual text-based nodes, you’ll be looking at metrics like the F1 score, which measures the lexical overlap between the expected answer and the system’s answer. However, the shortcomings of such metrics are apparent when we talk about testing semantic NLP systems, whose entire purpose is to abstract away from individual words and focus on the meaning of a text.

Two people could write vastly different summaries of the same text, and both could be equally appropriate. To capture that property, different frameworks have proposed Transformer-based metrics like our own semantic answer similarity (SAS), which measure the semantic rather than the lexical congruence of two texts. However, as such methods are only slowly gaining wider traction, complementing your quantitative analysis with a qualitative one remains absolutely necessary.

Qualitative Evaluation

Even if the data you use for your evaluation data set is relatively recent, nothing beats evaluating your system using real user feedback. One of the many advantages of the prototyping enabled by NLP pipelines is tooling for rapid training, deployment, and even user feedback cycles. Quick turn around on prototypes also makes A/B testing a breeze, and allows you to flush out bugs rapidly. 

Share a simple browser-based prototype with a representative group of users and have them submit queries as they would to your final system. You can then ask them to evaluate the answers they received from the system. Aggregated, these judgments are valuable numbers that will help you decide which system to use in production, or which aspects of your system need improvement.

Testing your system with real-world users has a second benefit: it will alert you to discrepancies between your training or evaluation data set and the language used in the real world. Perhaps the demographic of your users has shifted, and they talk about different topics than they did when you collected your data. Perhaps they use words or abbreviations that your system has never seen before. You’ll only know by investigating the data produced by your actual users. 

If it turns out that the language used by your users differs significantly from what your system has learned, then it’s time to look into another data-driven technique: fine-tuning your language models.

Another advantage of having a deployed and integrated prototype, as many deepset Cloud users have told us,  is that it can be extremely valuable when you want to showcase your system to different stakeholders. You can describe your system and its capabilities all you want, but by letting managers or clients try it out for themselves, you’ll allow them to truly get a feel for your NLP system and what it can do.

Towards a Data-centric AI

Hopefully you now have a clearer idea of the challenges and tools for applying a good NLP annotation and evaluation strategy to your system. But beyond evaluating which models and system parameters are most successful for your use case, is there additional value in this emphasis on accuracy metrics and user feedback?

In fact, diligent attention to testing and evaluation is related to a large trend in machine learning dubbed “data-centric AI.” This is the concept that deep learning models are generally best improved by focusing on the quality of training data rather than the model architecture itself. Researchers in data-centric AI have come up with best practices regarding the annotation process, dataset documentation, and “data in deployment” — that is, the ongoing need for monitoring and updating the training data for a system in production.

In most existing NLP systems that we consult on, the clearest path to major system improvements is data refinement. This might mean updating annotation guidelines to have a more precise labeling scheme, or getting granular user data to find which use cases are falling through the cracks during training. These data-centric successes are just one more reason that we recommend everyone begins with solid evaluation and annotation procedures for their NLP systems.

Testing NLP Projects Together

This post is an adaptation from a section of our ebook “NLP for Developers.” The ebook takes a deep dive into the practical considerations of modern NLP, and aims to provide insight to fledgling data scientists and seasoned language modelers alike. If you find yourself wondering about questions like:

  • Do I need to understand all the intricacies of the Transformer model architecture? 
  • What is the state-of-the-art process for scaling prototypes to production-ready systems?
  • How can I efficiently compare different pipeline architectures and hyperparameter settings to find those that work best for me? 
  • Which existing language models would best adapt to my use case? 
  • How can I best collaborate with back-end engineers on my team towards the final deployment of the NLP system? 

… then “NLP for Developers” is for you.

If these questions seem beyond the technical depth you work at, you might be interested in our first ebook “NLP for Product Managers,” a process-focused guide to applied NLP, which focuses less on technical details and aims to help you administrate a high-function NLP project.

Beyond our ebooks, deepset is proud to support and participate in the vibrant community of NLP developers. Our blog hosts reflections and explainers gathered by our team. Meanwhile, our Haystack framework is an open source toolkit that can help you design, build, and — yes — test an NLP system end-to-end.