Detecting Hallucinations in deepset Cloud

Our new node verifies if an LLM’s answers are grounded in facts

The face of a cat emerges from a treetop (Alice in Wonderland illustration by John Tenniel, via Wikimedia Commons)

Large language models are useful but often unreliable. By now, we've probably all seen examples of LLM-generated responses that contained fabricated facts. In particular, when an LLM is asked about something it has no information about – such as recent events that happened after its training cutoff date – it may simply make up facts.

This phenomenon is commonly referred to as “hallucination.” Because hallucinations are difficult to detect, even for humans, they are one of the biggest obstacles to using LLMs in industry applications today. If you’re thinking about building an LLM-based, customer-facing system, giving your users false information could have disastrous consequences.

To address this issue, a small team here at deepset has spent the last months researching and implementing methods to combat hallucinations head-on. We’re excited to announce our new hallucination detector for retrieval-augmented generation, which tells you whether or not a model’s output is an accurate reflection of the documents in your database.

If you’ve been thinking about implementing an LLM-based product and don’t want to undermine your users’ trust, if you’ve been worried about hallucinating LLMs in general, or if you just want to see our state-of-the-art hallucination detector in action, then this article is a must-read.

Why do hallucinations happen?

A defining characteristic of natural language is its creativity. As fluent speakers of a language, we can come up with novel phrases on the fly and still expect to be understood, given the right context. For example, just a year ago, the phrase “hallucinations in large language models” probably wouldn’t have meant anything to most people. Yet these days, it seems like a perfectly normal thing to discuss.

Large language models have been trained to perform their own version of linguistic creativity. Through prompting, we can guide them to use that creativity to write product descriptions, draft emails, chat with customers, and so on. But these models are so tuned to do our bidding that they continue to generate content even when they don’t know the answer. At that point, they often start to hallucinate.

Since LLMs have become more prevalent in many people's lives, there have been countless reports of these models making up academic references, misquoting people, and contradicting themselves, often to comical effect. So if you have no way to fact-check an LLM’s claims on a granular level, they become pretty useless for most real-world applications.

How to combat hallucinations

One of the most effective techniques for avoiding hallucinations in the first place is retrieval-augmented generation (RAG), which we wrote about in our last blog post. RAG extracts documents from a curated database in response to a query. It then asks the LLM to base its response on these documents. This allows us to combine an LLM's conversational skills and general knowledge of the world with fact-checked data from a source we control.

Another impressively simple technique is to tune your LLM prompt to guide the model toward the right answer. You can, for example, tell it explicitly that it should admit if it doesn’t know something. You can also push the LLM to ground its responses in facts by asking it to add citations to its claims, like the ones we find in scientific documents. Here's an example prompt:

Answer the question truthfully based solely on the given documents. Cite the documents using Document[number] notation. If multiple documents contain the answer, cite those documents like 'as stated in Document[number], Document[number], etc.'. If the documents do not contain the answer to the question, say that answering is not possible given the available information.

But even if you combine all the tricks mentioned here – and we strongly recommend you do – even a retrieval-augmented model with a cleverly engineered prompt can still hallucinate.

In recent months, an increasing number of researchers have been looking for ways to determine when a generative model is hallucinating. Most of the proposed techniques require a ground truth against which to compare the model's responses, which makes them particularly useful in a RAG scenario.

Hallucination detection for RAG in deepset Cloud

In deepset Cloud, we work with modular pipelines from our OSS framework Haystack. These allow us to stick together different building blocks and connect them to each other. 

In our example, we use a pipeline with:

  1. a retrieval node that finds the most suitable documents from the database to answer a query, 
  2. a prompt node that connects to the LLM, and 
  3. the hallucination detector node.
Sketch of a RAG pipeline with a hallucination detection node.

The hallucination detector produces a score that expresses how semantically similar a sentence is to a given source document. Researchers at Ohio State University were among the first to work on this problem, which is known as “evaluating attribution by large language models.” We used data from their project to train our own model: a fine-tuned DeBERTa that beat their best model by ten points.  

The scores that the hallucination detector returns are converted into one of four categories. “Full support” means that a sentence is highly similar to at least one of the documents. “Partial support” is equivalent to a medium similarity score, while “no support” signifies a low similarity score. The fourth category is “contradiction”, which means that the claim in the answer sentence is incompatible with what the source documents say. 

The hallucination detector in action

To showcase how the hallucination detector works, we set up a prototype on top of a small database. It contains transcripts of earnings calls in 2022 from Alphabet, Apple, and Microsoft. Using deepset Cloud’s interactive demo feature, we can now ask questions like the following: 

Screenshot of a fact-checked answer in deepset Cloud, no hallucination detected.

The model answers our query confidently, and the hallucination detector concurs that the answer is correct, classifying it as being fully supported by the documents in our database. By looking at the document listed under “Sources,” we can check for ourselves that this is true:

Operating income was $18.2 billion, down 17% versus last year, and our operating margin was 24%. Net income was $13.6 billion. We delivered free cash flow of $16 billion in the fourth quarter and $60 billion in 2022.

To verify that the hallucination detector is really working, we change our query slightly, to ask about an event that the database doesn’t contain any information about:

Screenshot of a fact-checked answer in deepset Cloud, hallucination detected.

Rather than indicating that the information we requested is not available yet, the LLM produced a hallucination. Luckily, our model is able to classify it as such. Similarly, when we ask about a tech company whose earnings call transcript is not in our dataset:

Screenshot of a fact-checked answer in deepset Cloud, hallucination detected.

To be clear, this kind of obviously hallucinated response shouldn't happen often if the model is powerful enough and the prompt is well designed. For demonstration purposes, we used a weaker prompt than we would normally use, resulting in the examples we see here. But even if hallucinations don't happen that often, knowing when they do occur is invaluable.

Beyond hallucination detection

Being able to compare an LLM’s output to a ground truth is only the starting point. How you use that information is up to you and depends on the degree of precision that your application requires.

For example, you could decide to show your users only fully and partially supported sentences to be on the safe side. Or you could show them the hallucinations as well, and include a disclaimer that they should take the claim in question with a grain of salt.

Knowing where a hallucination could occur in your LLM’s output gives you the power to decide what to do with it. This is a great step towards making LLMs fully ready for applications even in sensitive industries like law, insurance, and finance.

Making LLMs production-ready

Hallucinations in LLMs are a complex topic, and much more remains to be said about them. If you want to hear in more detail about how we implemented the hallucination detector for deepset Cloud and see how it compares to other approaches, check out Thomas Stadelmann’s webinar on detecting hallucinations in deepset Cloud. In the webinar, Thomas takes us through an interactive demo of hallucination detection in action.

Our current implementation for hallucination detection is just the beginning – we’re on a mission to improve the reliability of systems fueled by powerful LLMs. To stay updated about what we do, make sure to follow us on Twitter and find us on LinkedIn.