Supercharging Elasticsearch with Haystack Neural Question Answering

Start using the latest in NLP-driven search technology today!


Branden Chan

By now we’re all aware of the power of data. Yet having access to data is only half the battle — what’s equally important is our ability to make sense of it. That’s why we created Haystack, a framework for building open-domain question answering systems that will allow your customers to ask questions in natural language, and receive answers right away.

Our technology lets you take the power of the latest Transformer-based language models, and combine it with the speed of Elasticsearch distributed storage. Read on to learn more about using Haystack with Elasticsearch.

Question Answering with Haystack

Natural language is unstructured data, meaning that it’s tricky to handle computationally. Luckily, the recent advances in natural language processing (NLP) have brought about a number of ingenious solutions for representing language data.

Deep neural language models like BERT and RoBERTa are trained to represent words as complex vectors known as word embeddings. These vectors are used to perform various NLP tasks, often by stacking additional layers on top of the word embedding network. That, in short, is how language models may be used in the context of question answering (QA).

Haystack’s Retriever-Reader pipelines perform what’s known as “open-domain extractive question answering.” For any query, they identify a number of answer candidates from a text collection, which are ranked according to their matching probability. The pipeline’s output, then, is a list of passages most likely to answer the given question — which might include no answer at all, if none passes the matching threshold. Check out our blog post for more detail on how QA systems work under the hood.

Creating your own end-to-end question answering pipeline with Haystack is easy, with many different configurations at your disposal. Our QA systems consist of three main customizable modules that are invoked in the following order:

  1. DocumentStore: This is where you keep your documents to be searched by the model. Haystack has different storage solutions depending on the use case, and your choice will also depend on whether you opt for sparse or dense document representations. In this tutorial, we’ll be using Elasticsearch as our document store. During initialization, you may create a new ES cluster, or connect to an existing one by specifying the host and port in the object’s parameters. If you’re using standard authentication with HTTP, you may pass on your username and password to the store using the respective parameters.
  2. Retriever: It’s not feasible to apply the computationally expensive QA model to all documents, each time we ask a question. The Retriever module acts as a quick filter for finding the best candidate documents by calculating the similarity between query and document vectors. Those vectors can be sparse (working with simple word count functions like tf-idf or BM25) or dense (using Transformer-based embedding models). While sparse representations work out-of-the-box, dense ones rely on pre-trained language models that may be adapted to your data. If you connected to an existing ES cluster in the previous step, and choose to work with a dense retriever, you’ll have to update your database with the dense vector representations.
  3. Reader: After applying the retriever sieve, we’re ready to run the selected documents through our fine-meshed QA model. In this module, you may choose between FARMReader and TransformersReader (learn about their differences here). You’ll also need to decide on a language model, which will depend on your computational resources, your use case — and of course, the language that you work with. You’ll also need to consider any time constraints. For example, the ALBERT XXL model is best in class in terms of performance, but it also requires much more computational power than RoBERTa. On the other hand, the MiniLM model manages to be twice as fast as the base model by sacrificing a bit of accuracy.

In the next sections, we’ll show you how to adapt the components of a Haystack question answering pipeline to your use case.

Extending Elasticsearch Capabilities with Haystack

Elasticsearch (ES) is a NoSQL database and search engine that stores its documents in a decentralized manner, distributing them over several nodes. In addition to its distributed and schema-less nature, Elasticsearch offers solutions for querying natural language documents. Unstructured text documents are reverse-indexed, allowing for quick matching to search queries. Organizations of all sizes and purposes use ES as a search and analytics engine and to manage their documents.

However, despite using some preprocessing techniques that make it possible to, say, match a singular search term to its plural form in a document, Elasticsearch queries are still keyword-based. In other words, they do not understand syntactic or semantic subtleties. For instance, as any Game of Thrones fan would attest, there’s a big difference between the two questions, “Who does Daenerys kill?” and “Who kills Daenerys?”

Spoiler alert: the examples in this article contain answers that may spoil your Game of Thrones experience.

This is where BERT and similar models come in. Combining complex Transformer-based architectures with huge amounts of training data, these models understand language at a much deeper level than simple keyword combinations. By endowing the models with Elasticsearch’s quick lookup and storage capabilities, Haystack makes it possible to use natural language queries to search an ES database.

Running Haystack with Elasticsearch: a demonstration

To follow along with this example, make sure that you have access to a GPU. While some Haystack models can run on a CPU, it will slow them down considerably. Google’s Colaboratory (or Colab) provides free GPU access, and you may want to use it to try out the models. If you want to see how we handle the setup and installation in Colab, check out this tutorial. For our present example, we’re using a GPU-enabled cloud compute instance and will run all commands through SSH.

Setting up Haystack and Elasticsearch

We’ll also be using Docker to create an Elasticsearch instance. To install Docker, follow the instructions for your operating system.

Once Docker is set up, we use it to pull the Elasticsearch image and start a single-node cluster:

docker pull
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

We start out by creating a virtual environment with Conda and then activating it:

conda create -n haystack python=3.7
conda activate haystack

Within our virtual env, we install the Haystack framework from PyPI:

pip install farm-haystack

Initializing the question answering pipeline

Let’s fire up Python and import all relevant functions and classes to build our QA pipeline:

from haystack.document_store import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore()

We included the last line to get rid of all leftover documents that might still occupy our storage from the last run. This step is important to avoid receiving duplicate answers.

Note that the delete_all_documents() command is scheduled to be replaced by the more versatile delete_documents().

It’s now time to fill our document store with documents! ES expects a list of dictionaries as input. For this example, we’ll use our trusted Wikipedia subcorpus consisting of 183 articles about the Game of Thrones universe. The following functions are needed to shape our data into a format that ES likes:

from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from import clean_wiki_text

If you’d like to view the articles to get a feel for the documents we’re working with, simply download and unzip the file following the link below:

doc_dir = “got_texts”
got_wiki_url = “"fetch_archive_from_http(url=got_wiki_url, output_dir=doc_dir)got_dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

How many documents have we got in our list of dictionaries?

len(got_dicts)>>> 2497

We get this result because we split the documents into paragraphs. Note that a “document” in ES is not necessarily equal to a document in the real world. Breaking longer documents up into shorter passages eases search and reduces computation times.


Now that we’ve read our text collection into the document storage, we’re ready to build a pipeline. It’s at this point that we actually have to make some decisions about the nature of the models that make up our system.

For our use case, we’ll stick to the sparse ElasticsearchRetriever:

from haystack.retriever.sparse import ElasticsearchRetriever, TfidfRetriever

retriever = ElasticsearchRetriever(document_store=document_store)

In the next step, we’ll have to decide on a Reader class and which language model to use. We’ll be using our very own FARMReader and will load it with RoBERTa’s language model which has been pre-trained on the SQuAD v2 question answering dataset. RoBERTa is a solid choice when working on a GPU. But if you only have access to a CPU, you might want to switch to the light-weight MiniLM model instead. Let’s import the model:

from haystack.reader import FARMReader

my_model = “deepset/roberta-base-squad2”
reader = FARMReader(model_name_or_path=my_model, use_gpu=True)

For every query, we want to run the Retriever first, followed by the Reader. With the help of Haystack’s pipeline module, we can simply chain the two steps together. Let’s also import the handy print_answers() function for pretty-printing the results of our query:

from haystack.pipeline import ExtractiveQAPipeline
from haystack.utils import print_answers

pipe = ExtractiveQAPipeline(reader, retriever)

Running queries

Will we finally be able to get an answer to our question? Let’s ask:

question = “Who kills Daenerys?”
answer =, top_k_retriever=10, top_k_reader=2)
print_answers(answer, details=”minimal”)

This returns:

{‘answer’: ‘Jon’,
‘context’: ‘rion denounces Daenerys and is imprisoned for treason to await execution. Jon, unable to stop her, kills Daenerys. Bran Stark is proclaimed king, allo’
{ ‘answer’: ‘Daario’,
‘context’: ‘aenerys that she must send a champion to fight the Champion of Meereen. Daario is selected and he kills the Champion. Daenerys tells slaves from Meere’

What if we wanted to know more about these answers — such as getting the model’s degree of certainty as to its predictions? Setting the details parameter of print_answers() to all gives us all the information we need. Because this setting produces quite a long output, let’s just look at the first result:

‘answer’: ‘Jon’,
‘context’: ‘rion denounces Daenerys and is imprisoned for treason to await execution. Jon, unable to stop her, kills Daenerys. Bran Stark is proclaimed king, allo’,
‘document_id’: ‘eb7f0b2f-688a-4134-beb0-ee6353309158’,
‘meta’: {
‘name’: ‘360_List_of_Game_of_Thrones_episodes.txt’
‘offset_end’: 77,
‘offset_end_in_doc’: 751,
‘offset_start’: 74,
‘offset_start_in_doc’: 748,
‘probability’: 0.9234894514083862,
‘score’: 9.971123695373535

The model is 92 % confident that Jon is Daenerys’ killer. Indeed, that’s the value of greatest interest at this point. Let’s write a quick function that will only return the answers and their degrees of probability:

def short_answer(answers):
for i, answer in enumerate(answers[‘answers’]):
print(‘Answer no. {}: {}, with {} percent probability’.format(i, answer[‘answer’], round(answer[‘probability’] * 100)))

In the previous example, our model seemed pretty certain. Let’s do a quick sanity check and see if our model indeed understands that the following question is different from the previous one:

question = “Who did Daenerys kill?”

>>> Answer no. 0: Kraznys, with 43 percent probability
>>> Answer no. 1: Thirteen, with 22 percent probability

It definitely understood the difference between Daenerys the victim and Daenerys the agent. But the model doesn’t have much confidence in these answers. To increase the quality of the model’s output, we may extend the search space: by hiking up the top_k_retriever parameter, we increase the number of documents that our Reader sees.

Be aware that this decision will have an impact on our search time. Because the Reader is much more computationally expensive than the Retriever, passing more documents to the Reader means that our program will take significantly longer to run. To learn more about how your choice of parameters might impact the system’s speed, check out our article on optimization.

Let’s see how a higher number of potential answer passages impacts our output:

question = “Who did Daenerys kill?”
answer =, top_k_retriever=100, top_k_reader=5)

>>> Answer no. 0: slave masters, with 98 percent probability
>>> Answer no. 1: thousands of innocents, with 73 percent probability
>>> Answer no. 2: Samwell Tarly’s father and brother, with 90 percent probability
>>> Answer no. 3: all slavers, with 40 percent probability
>>> Answer no. 4: Khal Moro’s bloodriders, with 50 percent probability

This looks great! Not only do these answers better address our question, the system has greater confidence in them. Sometimes it makes sense to sacrifice speed for accuracy.

As we’ve established, being able to return None in the absence of any solid answer candidates is part of a trustworthy question answering system. This makes sense — after all, we’d rather have someone admit that they don’t know the answer to a question, rather than giving us incorrect information. Let’s see what happens when we ask about entities that are foreign to the GoT universe:

question = “Where is Warsaw?”
answer =, top_k_retriever=10, top_k_reader=1)

>>> Answer no. 0: Iron Islands, with 1 percent probability

It’s pretty unlikely that our GoT corpus contains any information about the Polish capital. However, our system still returns low-probability answers. To change that, we need to enable the return_no_answer parameter when we initialize the reader:

reader = FARMReader(model_name_or_path=my_model, use_gpu=True, return_no_answer=True)

When we run the same query through the reader, this is the output:

>>> Answer no. 0: None, with 67 percent probability

In addition to enabling return_no_answer, we may set the no_ans_boost parameter to some positive integer value to nudge our models towards returning more None answers. This is useful if you want your system to only return an answer when it’s reasonably certain.

Haystack: A Modern Solution for Neural Search and Question Answering

Looking to integrate question answering into your applications? At Haystack, we’re happy to guide you through the process!

To learn more about how to customize your question answering systems with the latest neural language models, check out the Haystack documentation.

Find Haystack on GitHub and give us a star if you find it useful! :)