Accelerate Your Haystack Question Answering System with GPUs

We explain what GPUs are, and demonstrate how you can leverage them to accelerate a Haystack question answering system.

16.07.21

This article is the second in our series on optimizing your Haystack question answering system. We’ll link to the other articles here as they go online:

Parameter-Tweaking
Accelerate Your QA System with GPUs
Picking The Right Reader Model
Metadata Filtering

Deep learning’s success is often attributed to advancements in algorithms and data collection, but less is said about another equally important component: the hardware.

Neural networks have greatly improved the performance of natural language processing over the years, and the field went through a breakthrough in 2018 with the introduction of deep language model BERT. Since then, transformer-based language models (BERT included) have been steadily advancing the state of the art. However, this would be hardly possible without a powerful piece of hardware: the graphics processing unit (GPU).

In this post, we’ll explain what GPUs are, and demonstrate how you can harness their power to accelerate your Haystack question answering system.

What Is a GPU?

A graphics processing unit is a computer chip made specifically to accelerate rendering graphics. The first ever GPU was made in 1999 by NVIDIA. With the release of their flagship GeForce 256, the company sought to meet a growing computer graphics demand that central processing units (CPUs) had been unable to fulfill.

CPUs have only a few cores and perform computations sequentially, which makes them rather slow. GPUs, on the other hand, consist of thousands of cores that excel at performing computations in parallel. Displaying video graphics requires recomputing all the pixels of a frame with each refresh. GPUs address that need with their ability for parallel processing.

At its core, deep learning consists of immense amounts of computations involving vectors and matrices. Just like graphics rendering, these are easily parallelizable operations. GPUs therefore prove to be the ideal solution for the training and inference of artificial neural networks.

How Do GPUs Speed Up Transformers?

The Transformer is a pioneering neural architecture that goes beyond its precursors with its ability to be trained in a fully parallel manner. Neural networks consist of a large number of interconnected nodes whose weights are updated iteratively during training. Therefore, not only can a Transformer be trained much faster on a GPU, but the increase in speed also allows it to draw on much larger datasets than before.

The benefits of using GPUs do not stop at training. Because Transformer models are truly colossal, consisting of billions of parameters, the inference process likewise requires a lot of computational power. Although Transformers can be used with a CPU, the current trend of building bigger models on more data makes the GPU a practical necessity for inference as well.

Generally, GPUs will speed up all Transformer-based components. In Haystack, these include:

All Readers
DensePassageRetriever
EmbeddingRetriever
Summarizer
Generator
Translator
Ranker

How Can I Use GPUs with My Haystack Question Answering Pipeline?

We’ll now cover two options for GPU providers that you can use to accelerate your workflow.

Google Colaboratory

Google Colaboratory is a Jupyter Notebook environment created by the Google Brain research team to facilitate machine learning development. Google Colab’s most noteworthy feature is that it provides a Tesla K80 GPU for your use, completely free of charge!

There’s no setup involved. You’re assigned a decent amount of storage (the amount fluctuates depending on usage and availability) and a notebook that comes preinstalled with Python’s essential machine learning libraries. To use a GPU, head over to “Runtime” → “Change runtime type” in the menu and select “GPU” as your hardware accelerator.

There’s one small catch: to ensure that Colab’s resources are used responsibly and remain free to use, each session is limited to about 12 hours of runtime. Once your time is up, all of your files and local variables are deleted.

To get a better feel for what development looks like in Google Colab, check out our tutorial where we use the environment to implement a question answering (QA) system.

Amazon Web Services (AWS)

If you’re looking for a longer-term solution, we suggest a paid option with AWS.

With AWS, it takes less than five minutes to commission a virtual server with a GPU. If you opt for this route, make sure you choose an instance with a configuration intended for deep learning. As of this writing, we recommend using a g4dn instance with an Ubuntu Deep Learning AMI. This way, all major deep learning frameworks and GPU drivers will come preinstalled so you won’t have to manage any of the dependencies yourself.

Once you’ve commissioned an instance, you can connect to it via SSH and treat it as if it were your local machine.

How to Check If Your GPU Is Running

Regardless of your choice of GPU provider, you’ll want to make sure that you indeed have access to a GPU before you start coding.

The best way to check is by running NVIDIA’s GPU monitoring tool by typing nvidia-smi into your machine’s terminal. Here’s an example:

The tool prints out various statistics regarding our GPU. Most importantly, however, it tells us that there’s a Tesla T4 GPU with 15109 megabytes of memory available, none of which is currently used.

Memory size is the number one consideration in choosing a GPU. It should be large enough to accommodate the amount of memory the model consumes — otherwise, the system might crash. Memory consumption is dictated by the size of the model and the number of samples in a batch. It’s also much higher for training than for inference tasks. Unless you’re fine-tuning a model in Haystack, you’ll usually only be doing inference. For that, a GPU with a memory size between 8 and 16 GB will be ideal.

Additionally, you can check for the availability of a GPU with PyTorch, which Haystack builds on top of. Running the code below in your Python interpreter should return “True” if your GPU is supported by the current version of your CUDA drivers.

import torch
torch.cuda.is_available()

How to Connect Your Haystack Components to a GPU

After ensuring that you have a GPU and that the correct drivers are installed, you can instruct your Haystack components to use the GPU by way of the use_gpu parameter:

dpr_retriever = DensePassageRetriever(document_store=document_store, use_gpu=True)
farm_reader = FARMReader(model_name_or_path=model, use_gpu=True)
transformers_reader = TransformersReader(model_name_or_path=model, use_gpu=2)

Notice that for both dpr_retriever and farm_reader, we enable use_gpu by setting it to True. This makes it use the first available GPU.

In the case of transformers_reader, however, we assigned an integer value to use_gpu. This is useful if you have multiple GPUs available and want to instruct the reader to use a particular one by providing that GPU’s ordinal number.

As another sanity check, when you instantiate your components, Haystack will print out a message similar to the one below, confirming that you’re indeed utilizing a GPU:

Now that we’ve explained how to incorporate GPUs into a Haystack pipeline, we’ll go on to demonstrate their potential impact on your system’s speed.

How Do GPUs Impact My System's Performance?

For our demonstration, we’ll be building on top of the system implemented in this article, where we built a question answering pipeline using a dataset of texts about the Harry Potter franchise. For the sake of brevity, we won’t be reimplementing it here.

We initially came up with several questions to test the system’s knowledge:

questions = [“What’s Fleur’s last name?”,
             “What is Fleur Delacour’s nationality?”,
             “What is the name of Fleur Delacour’s sister?”,
             “How old is Gabrielle Delacour?”,
             “What house does Fleur Delacour belong to?”]

How long it takes to run inference on these questions will depend on your choice of hardware. As we’ve outlined earlier, CPU runtimes tend to be significantly slower than those for GPUs — but let’s measure how much slower exactly. We’ll get our pipeline to run on a CPU by passing use_gpu=False to the reader’s and the retriever’s constructors.

cpu_reader = FARMReader(model_name_or_path=my_model, return_no_answer=True, use_gpu=False)
cpu_retriever = DensePassageRetriever(document_store=document_store, use_gpu=False)
cpu_pipe = ExtractiveQAPipeline(cpu_reader, cpu_retriever)

Next, we’ll pass our questions into the new pipeline:

answers = []
for q in questions:
    answer = cpu_pipe.run(q, top_k_retriever=100, top_k_reader=1)
    answers.append(answer)

In the next sections, we’ll be looking at the time statistics for all queries performed so far by calling the print_time() method on the different components in our pipeline.

Readers

We’ll first print the time for the GPU reader that we implemented in the original post:

reader.print_time()>>> Queries Performed: 5
>>> Query time: 28.636669066999957s
>>> 5.727333813399992 seconds per query

And now for the CPU reader implemented above:

cpu_reader.print_time()>>> Queries Performed: 5
>>> Query time: 134.99792738300016s
>>> 27.000585476600032 seconds per query

Running the reader on a GPU takes about six seconds per query, whereas the CPU took 27 seconds — almost five times as long! You’ll experience a significant speedup when using GPUs with Haystack readers since all readers employ transformer-based language models.

Retrievers

Let’s now print the runtimes for the retrievers. First, for the GPU retriever:

retriever.print_time()>>> Queries Performed: 5
>>> Query time: 0.850802140000269s
>>> 0.1701604280000538 seconds per query

And then for the CPU retriever:

cpu_retriever.print_time()>>> Queries Performed: 5
>>> Query time: 1.0323837620001086s
>>> 0.20647675240002172 seconds per query

The GPU again outperforms the CPU, but this time only marginally. While this speed differential may not seem like much, it adds up when we start scaling up the number of queries. Therefore, employing a GPU is always a good idea when your pipeline includes a transformer-based retriever.

Aside from querying, GPUs also speed up indexing. Indexing is the process by which retrievers make a document collection searchable. Dense retrievers like the DensePassageRetriever run each document through a pre-trained neural network during indexing. That’s a parallelizable operation that benefits from using a GPU. On the other hand, sparse retrievers like tf-Idf or BM25 will not benefit from parallelization.

Let’s compare how long dense indexing takes on a GPU compared to a CPU. First, the GPU:

start = time.time()
document_store.update_embeddings(retriever)
end = time.time()
print((end — start)/60)>>> 20

And now the CPU:

start = time.time()
document_store.update_embeddings(cpu_retriever)
end = time.time()
print((end — start)/60)>>> 210

Indexing on a GPU took 20 minutes, which is significantly quicker compared to 210 minutes on a CPU. All of these timings use a database of 50,000 documents. In practice, your knowledge base might be much larger, in which case indexing on a CPU may not be a feasible option at all.

As we’ve shown, using GPUs can greatly accelerate your workflow. While working with GPUs is practically unavoidable during training, they’re also often required to help mitigate bottlenecks caused by slow inference times.

Accelerate Your Haystack Pipeline

In this post, we explained the benefits of using a GPU and demonstrated how you can easily incorporate one into your Haystack question answering pipeline. In fact, we advise all Haystack users to work on a GPU to get the most out of our framework.

Time to head over to our GitHub repository for more information on how to get started with Haystack! :)