Evaluating RAG Part II: How to Evaluate a Large Language Model (LLM)

Let’s discuss the unique challenge of evaluating the output of generative AI models – and the most promising solutions

Semantic similarity is one of the most intriguing and challenging problems in language modeling. How do we even begin to compare the creative output of generative AI to a predefined ground truth? However, the proliferation of large language models (LLMs) in various industries means that we must find ways to evaluate their quality.

This article is the second in a mini-series on evaluating RAG pipelines. In our previous installment, we discussed the importance of evaluating retrievers. Now it’s time to turn our attention to the heart of any RAG system: the LLM itself.

The LLM’s role in a RAG pipeline

RAG stands for retrieval augmented generation, an LLM pipeline paradigm. The first part of a RAG pipeline is a retrieval engine that identifies relevant documents in response to a user query. Embedded in the query, these documents are then presented to the LLM as a kind of fact-checked knowledge base for its ansgooglewer.

In a basic RAG setup, there are at least two sources of error: the retriever and the LLM. In our previous article, we discussed how the performance of the retriever can affect the overall results of the pipeline. We also saw how to quantify a retriever's performance using annotated datasets and metrics such as recall and mean reciprocal rank (MRR).

However, if your retriever is performing well enough, and your RAG pipeline still isn't delivering, you may want to take a closer look at another component: the large language model itself.

Why evaluating generative models is hard

To evaluate an LLM in an automated way, we need to compare its outputs to an annotated dataset. So we feed a prompt to the model and compare the output with an answer defined by our annotators. This is easier said than done.

One reason is that language is creative, and utterances that use different words and sentence structure can still express very similar things, such as the following two sentences:

Léa is doing a great job.

She's killing it.

Second, language is contextual. So if I told you that "she" in the second sentence referred to someone who works as a butcher, you might reassess its similarity to the first sentence.

These complexities are compounded when we consider longer and more intricate texts. For example, we may want to evaluate how well our LLM summarizes text. To do this, we write our own summaries and compare them with the generated ones. However, even if the LLM-generated output is structured and phrased differently from ours, it could still be an excellent summary of the underlying documents.

Now that we understand the caveats involved in evaluating LLMs, let's look at existing techniques.

Evaluation methods for LLMs

When it comes to evaluating large language models, several techniques have been proposed. Here we look at the most promising ones, in ascending order of complexity, before discussing what we think would be an ideal evaluation method for generative AI. 

We distinguish between two main strands of evaluation metrics: word-based, lexical metrics, and model-based, semantic metrics. Lexical metrics are usually based on the concepts of precision and recall.

Recall penalizes false negatives: data points that are misclassified as incorrect. It helps evaluate LLMs by answering the question: "To what extent is the target (ground truth) sequence captured by the predicted sequence?" Precision penalizes false positives: data points misclassified as correct. It can answer the question: "To what extent is the predicted sequence grounded in the target sequence?"

Lexical metrics

Traditional metrics measure the extent to which the words in the generated response overlap with the words in the ground truth defined by the training data set, working mainly with the notions of precision, recall, and combinations of the two. The most widely used lexical metrics in NLP are BLEU, ROUGE, and F1.

BLEU was originally designed to evaluate translations. It's precision-oriented and measures the degree to which words and word sequences in the prediction occur in the target. Although it is purely based on the words used, the inclusion of word sequences (n-grams) introduces the possibility of measuring some syntactic aspects, such as correct word order.

ROUGE, on the other hand, is designed to evaluate summaries. It's really a collection of metrics that measure precision and recall in different ways. Like BLEU, it can be adjusted to account for word sequences of different lengths: ROUGE-1 takes only single words (1-grams or unigrams) into account, while ROUGE-2 looks at 2-grams (also known as bigrams), and so on. 

F1 is a metric that is widely used in machine learning, not only for evaluating language models. Its success is due to the fact that it captures both precision and recall in a symmetric way.

The obvious flaw of these lexical metrics is that they don't recognize purely semantic similarity, such as that conveyed by synonyms. They're also easily defeated by changing word orders, word insertions, and deletions.

Transformer-based metrics

Despite their shortcomings, purely lexical-based metrics are still widely used today to evaluate the performance of LLMs: they're robust and easily scalable. What's more, AI researchers have yet to come up with an all-purpose, semantics-based metric. In part, this may be because what we perceive as accurate or correct depends very much on the task at hand. Therefore, different tasks could require different semantic similarity metrics.

Existing metrics that measure semantic rather than lexical similarity are often based on the transformer architecture – the same technology that powers LLMs themselves. One transformer-based metric that has been around for more than two years is semantic answer similarity (SAS). Developed by a team of NLP experts here at deepset, SAS uses a cross-encoder model to quantify the similarity of the LLM's prediction to the ground truth, regardless of vocabulary.

Recently, SAS has received renewed interest as offerings such as LlamaIndex have included it in their evaluation frameworks. And of course, you can use it in our own OSS framework, Haystack, and in our hosted LLM platform, deepset Cloud. However, rather than using it as the only evaluation metric, our recommendation is to combine SAS with other, lexical-based metrics.

Since transformer-based metrics must be trained on textual data, their performance depends on how close the evaluated language is to that training data. Thus, to evaluate predictions using a different style or even language, these metrics would need to be fine-tuned or even retrained from scratch.

The pitfalls of LLM evaluators

Since LLMs are so great at reasoning, why don’t we use one LLM to evaluate the output of another LLM? While many AI practitioners have followed this line of thought, they have quickly found some serious drawbacks to this approach:

  • It's expensive. LLM inference, whether through a provider like OpenAI or by hosting an OSS model somewhere yourself, costs a lot of money. But we want our evaluation methods to be highly scalable, so we can run them over and over again to benchmark many different setups.
  • It's unreliable. Studies have shown that LLM evaluators ranked responses according to the order in which they were given, and LLMs have been reported to have a strong bias toward their own predictions. 
  • It's hard to normalize. We evaluate a model not for its own sake, but to get a score – a numerical value between zero and one that we can then use to compare with other models. While we can give the LLM evaluator guiding examples in our prompt, it's still extremely hard to get it to reliably output a normalized value.

An ideal LLM metric

Semantic evaluation metrics are still in their infancy, unlike the language models they are designed to evaluate. This opens up a fruitful area for research. In our view, for an LLM metric to be successful, it needs to be adaptable to different use cases. For example, a chatbot can be a bit more verbose if it results in more helpfulness to the customer. A summarization tool, on the other hand, should be more concise and, most importantly, strictly grounded in the underlying documents.

Helpfulness, brevity, and groundedness are just some of the dimensions to measure the quality of an LLM. Others might include the creativity or originality of the output, coherence, appropriateness to context and audience, and whether or not the model correctly identified the presence of the correct answer in the text. An ideal LLM metric would allow users to weight these factors differently depending on the task.

At deepset, we're currently implementing a groundedness metric for RAG (retrieval augmented generation) tasks. This metric quantifies the degree to which an LLM's output is grounded in the documents it received from the retrieval component. When completed, it will allow our deepset Cloud users to measure and compare the fidelity of different models to an underlying knowledge base. This is an important step towards evaluating LLMs in a scalable manner, as it makes it increasingly easy to interpret the workings of large language models, and to trust their predictions.

Why user feedback is still king

All evaluation metrics in machine learning are designed to emulate a human using the model. That's because what we're ultimately trying to measure is the usefulness of our models in the real world. In the absence of a universal metric for LLMs, collecting feedback from real users remains essential. In deepset Cloud, we provide a simple feedback interface where users can give predictions a thumbs up or thumbs down. This in itself is very useful for comparing the performance of different models.

But user feedback can give you so much more: it can uncover pain points and usage patterns that you would never have thought of yourself, and it can provide useful information about the terminology that your users actually use in the real world and that your language model should be able to capture.

Let’s start evaluating

In this article, we've discussed various approaches to evaluating the output of LLMs. Combined with our article on retriever evaluation, this gives you a good foundation for evaluating retrieval augmented generative pipelines. To learn more, check out Rob's webinar on RAG evaluation, which walks viewers through best practices for planning and evaluating user feedback.