Indexing Documents for Large Scale Question Answering Systems

If you are building your own Haystack question answering system, you have to learn how to properly put the documents into the document database.

07.10.21

Andrey A.

In our previous tutorials, we covered how to build a Haystack question answering (QA) system and how to deploy a QA system as a REST API.

The examples in those tutorials rely on knowledge bases that are ready-to-query or built from homogenous data. But if you’re building your own Haystack question answering system and putting it into production, you’ll likely need a diverse, domain-specific knowledge base. Luckily, Haystack provides cleaning, splitting, and indexing tools that make working with text data a breeze.

Keep reading for the details on how to index your document database and incorporate it into your Haystack QA system.

Working with Databases and DocumentStores

Let’s first recap where databases fit within a Haystack question answering system.

Say you’ve used Haystack to deploy a QA system as a REST API. Whenever someone queries your QA system, the system’s HTTP server receives the HTTP request and forwards it to Haystack. Haystack then parses the request, extracts the question, and passes it to the question answering pipeline. The pipeline, upon receiving the query, consults a database to find an answer. In Haystack, we interact with this database through a DocumentStore.

A DocumentStore is a repository of all the resources that your QA system needs to answer a question. These include text documents and their corresponding metadata. Every Haystack QA system requires a DocumentStore and a database. However, Haystack’s DocumentStore only serves as an interface to the database; the database itself is separate from the Haystack framework.

For example, you might connect your DocumentStore to an SQLite database running on your local machine, or you might connect it to an Elasticsearch instance running as a service on the cloud. You can find the list of supported database solutions at our DocumentStore documentation page.

Document Indexing Tools

Haystack provides several tools to help you index documents into your DocumentStore. We’ll cover some of the most commonly used tools below.

PreProcessor

Unprepared data can be detrimental to your QA system’s performance. As a solution, we created the PreProcessor for routine text preparation tasks like cleaning text, removing whitespace, and splitting lengthy documents into smaller, more manageable units. The PreProcessor ensures a unified Python dictionary format across all of your documents and allows your Retrievers and Readers to make the most of your data. You’ll always want to run your documents through a PreProcessor before loading them into a DocumentStore.

Web Crawler

If you’re looking to enrich your DocumentStore with additional documents, you can do so with our Crawler. The Crawler creates documents from URLs and saves the scraped information into a directory. The Crawler automatically converts the scraped documents into the format supported by DocumentStores. This means that you can load the documents directly into your DocumentStore, or you can run them through the PreProcessor for further preprocessing before loading.

File Converters

Documents come in many formats, so we created the Converter so you can extract text from multiple formats. Whether you’re working with .pdf, .txt, or .docx files, the Converter will parse them all and transform them into the structure that’s required for working with DocumentStores.

How to Index Documents into Production Systems

The following sections cover some of the most common methods for indexing documents into a DocumentStore.

Method 1: Document indexing using Haystack scripts

Initialize a database

Before you can index your documents into a DocumentStore, you’ll need to create a database. The simplest way to initialize a database is by spinning up its Docker image. For example, to initialize an Elasticsearch database with Docker, we execute the following commands in the command line:

$ docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
$ docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

You can use Docker to launch instances of Milvus and Weaviate similarly to how we launched the Elasticsearch instance above. Alternatively, you can call Haystack’s utility functions such as launch_es() or launch_milvus(), which take care of launching your database’s Docker instance instead — the end result is the same. This requires writing only two lines of code:

from haystack.utils import launch_es
launch_es()

Keep in mind that databases run independently of Haystack. Further down in this tutorial, we’ll use Haystack scripts to interact with our Elasticsearch database. However, because the Elasticsearch instance runs on its own, it will continue running even after Haystack scripts finish executing. This means that your database will retain the documents you write into it until you explicitly delete them.

Connect your database to a DocumentStore

Once you’ve launched a database, you’ll need to connect it to a DocumentStore. As an example, you could connect a DocumentStore to an Elasticsearch instance using the following code:

from haystack.document_store import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore()

ElasticSearchDocumentStore by default connects to localhost and the port 9200 by default, but you can adjust these parameters as necessary. Check out the documentation for more information on initialization options.

To initialize a DocumentStore and connect it to your Milvus instance, you could likewise execute the code below:

from haystack.document_store import MilvusDocumentStore
document_store = MilvusDocumentStore()

Check out our DocumentStore documentation page for more information on launching different databases and connecting them to your document store.

Index your documents into a DocumentStore

Once you’ve initialized a document store, you’ll need to index it with documents. This is a crucial step that endows your QA system with a knowledge base to perform question answering. When indexing documents into a DocumentStore, we resort to Haystack functions from the modules that we introduced above: the Crawler, Converter, and PreProcessor.

We want to emulate a real-world scenario, where data is usually messy and comes in different formats and from multiple sources. Thus, we’ll create our own dataset by scraping data from Harvard and MIT websites. We’ll focus on the sections of the website that describe the universities’ climate action plans.

First, we’ll use the Crawler to retrieve the documents from MIT climate action plan webpage.

We’ll start by creating a directory where the Crawler will store the scraped documents:

$ mkdir crawled_files

We’ll store the webpage’s URL into a variable called url and let the Crawler do all the work. We only need to pass it the path of the output directory as well as the crawler_depth parameter that tells the Crawler how many sub-links it should follow from the initial list of URLs.

from haystack.connector import Crawler
url = "https://climate.mit.edu/climateaction/fastforward"
crawler = Crawler(output_dir="crawled_files", crawler_depth=1)
crawled_docs = crawler.crawl(urls=[url])

Let’s have quick look at the collected documents:

len(crawled_docs)

>>> 17

crawled_docs[:3]

>>> [PosixPath('crawled_files/climate.mit.edu_.json'),
PosixPath('crawled_files/climate.mit.edu_what-can-be-done-about-climate-change.json'),
PosixPath('crawled_files/climate.mit.edu_explainers.json')]

Our Crawler collected 17 documents, stored them as JSON files, and saved the file paths into a variable. The paths are stored as PosixPaths. These come from Python pathlib library that facilitates working with files, file paths and directories.

Seventeen files is not bad, but we should add more variety to our collection. Let’s expand our dataset with the text from Harvard climate action plan handout. We’ll run the following command line to download the PDF handout:

$ wget -O "Harvard_climate_action_plan.pdf" \
https://green.harvard.edu/sites/green.harvard.edu/files/Harvard%20climate%20action%20plan%20handout.pdf

We’ll run the Converter to extract the text from the PDF file. We’ll set the remove_numeric_tables parameter to “True” to remove numeric rows from the tables. This is a simple heuristic that improves the quality of the scraped text. Additionally, by setting the valid_languages parameter to “en,” we instruct the Converter to only extract texts in English.

from haystack.file_converter import PDFToTextConverter
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
pdf_doc = converter.convert(file_path="Harvard_climate_action_plan.pdf", meta=None)

Let’s inspect the converted document:

pdf_doc

>>> {'text': 'HARVARD UNIVERSITY CLIMATE ACTION PLAN\nA FOSSIL FUEL-FREE FUTURE\n\nHarvard researchers are tackling climate change by helping us better understand the scope of its effects and generating promising new solutions. On campus, the University community is also taking action.\nIn 2016, Harvard achieved its 10-year goal to reduce on-campus greenhouse gas emissions by...}

That looks right!

Our collection of documents now consists of almost 20 documents. However, these documents come in two different formats: JSON and PDF. Before we can load them into a DocumentStore, we need to convert the documents into a unified Python dictionary format as required by DocumentStores. The Converter automatically converts all files into this format; the crawled files require a bit more work.

As we saw earlier, the Crawler outputs a list of paths pointing to the documents that it collected. We’ll use these paths to load our JSON documents, convert them into Python dictionaries, and save them all into a Python list.

import json
json_files = [json.loads(open(crawled_doc).read()) for crawled_doc in crawled_docs]

Now that both the Harvard and MIT files are in the same format, we’ll concatenate them into one list:

# Note that pdf_doc is not a list so we need to
# enclose it within square brackets
concatenated_docs = json_files + [pdf_doc]

We can now load our documents through the PreProcessor.

from haystack.preprocessor.preprocessor import PreProcessor

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True
)

docs = preprocessor.process(concatenated_docs)

At last, we can index our documents into a DocumentStore:

document_store.write_documents(docs)

Just like that, we’ve managed to create a collection of documents from disparate sources, transform them into a common format, and index them into a DocumentStore.

Method 2: Define an indexing pipeline through YAML

You can also define your indexing pipeline via your pipeline’s .yaml configuration file. In fact, YAML lets you define your pipeline’s behavior, from how components are ordered and traversed by queries, to how documents are indexed. For more information on working with YAML files, check out our tutorial on configuring Haystack QA pipelines with YAML.

As an example of a YAML indexing pipeline, we’ll create a new file called pipelines.yaml and define the PreProcessor and the Converters under the components key:

version: '0.9'
components:    # define all the building-blocks for the Pipelines
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      host: localhost
  - name: Retriever
    type: ElasticsearchRetriever
    params:
      document_store: DocumentStore    
      top_k: 5
  - name: Reader      
    type: FARMReader   
    params:
      model_name_or_path: deepset/roberta-base-squad2
  - name: TextFileConverter
    type: TextConverter
  - name: PDFFileConverter
    type: PDFToTextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      split_by: word
      split_length: 100
  - name: FileTypeClassifier
    type: FileTypeClassifier

Our YAML indexing pipeline differs significantly from the system presented in Method 1: Document indexing using Haystack scripts above. This is because YAML pipeline imposes certain constraints:

Our updated system does not include the Crawler. Currently, Haystack doesn’t support the Crawler for indexing pipelines, so you’ll have to run it outside of the pipeline.
We included a FileTypeClassifier. This component classifies the files based on file extension and passes them on to the appropriate converters: PDF files go to PDFToTextConverter, txt files go to TextFileConverter, and so forth. This is crucial for working with different file types.

Now that we’ve defined the individual components, we can construct the indexing pipeline. Let’s take a look at the configuration below:

pipelines:
  - name: query    # an extractive-qa Pipeline
    type: Query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing      # an indexing Pipeline
    type: Indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter
        inputs: [FileTypeClassifier.output_2]
      - name: Preprocessor
        inputs: [PDFFileConverter, TextFileConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Note the sequential ordering of the components: all files that enter the pipeline first go through the FileTypeClassifier. The classifier assigns each file to the corresponding Converter, which in turn passes the converted files to the PreProcessor. Finally, when the PreProcessor finishes its task, it sends the files to the DocumentStore.

With our indexing pipeline fully defined, we can now load it:

import Pipeline
indexing_pipeline = Pipeline.load_from_yaml("haystack/rest_api/pipeline/pipelines.yaml")

To index the DocumentStore, simply call the run() method and pass it a list of file paths. Note that the file paths need to be of the pathlib.PosixPath type in order for the file type classifier to work.

indexing_pipeline.run(file_paths = pdf_file_paths)
indexing_pipeline.run(file_paths = txt_file_paths)
indexing_pipeline.run(file_paths = docx_file_paths)

And that’s it! We’ve successfully created an indexing pipeline through YAML.

Method 3: Use REST API for continuous indexing of documents

You can also index individual documents through the REST API by passing your files to the API’s POST /file-upload endpoint. The endpoint will send the files to the indexing pipeline specified in the INDEXING_PIPELINE_NAME environment variable. Note that if you’re using the YAML configuration from above, you’ll need to adjust your docker_compose.yml file located in Haystack’s installation folder to use that particular configuration:

# Mount custom YAML Pipeline
volumes:
    - ./rest_api/pipeline:/home/user/rest_api/pipeline

By default, the docker_compose.yml file uses a ready-to-query Elasticsearch DocumentStore based on Game of Thrones texts. You’ll need to comment out or delete the line pointing to the Game of Thrones image and add another line to start an empty Elasticsearch instance.

elasticsearch:
    # image: "deepset/elasticsearch-game-of-thrones"
    # This will start an empty elasticsearch instance (so you have to add your documents yourself)
     image: "elasticsearch:7.9.2"

After launching the REST API, we can index it with the Harvard’s climate plan handout that we downloaded earlier:

$ curl -X 'POST' \
  'http://127.0.0.1:8000/file-upload' \
  -H 'accept: application/json' \
  -H 'enctype: multipart/form-data' \
  -F 'files=@Harvard.pdf' \
  -F 'type=application/pdf'

Let’s try and query the pipeline:

$ curl --request POST --url 'http://127.0.0.1:8000/query' -H "Content-Type: application/json"  --data '{"query": "By when does Harvard plan to be carbon neutral?"}'

And here’s the answer:

{
  "query": "By when does Harvard plan to be carbon neutral?",
  "answers": [
    {
      "answer": "2026",
      "question": null,
      "score": 0.7068840265274048,
      "probability": null,
      "context": "2016\nEnergy-efficient buildings and rooftop solar\nFossil fuel-neutral by 2026\n100% renewable regional electric grid\nFossil fuel-free district energy s",
      "offset_start": 73,
      "offset_end": 77,
      "offset_start_in_doc": 5271,
      "offset_end_in_doc": 5275,
      "document_id": "6b715b6e4b36397aaeba0e9fb7011975",
      "meta": {
        "_split_id": 0,
        "name": "Harvard.pdf"
      }
...
}

Method 4: Index documents by directly adding them to the database

Lastly, you can index documents by directly interacting with the databases through their Python interface. However, this would require you to manually wrangle the data into the format required by a DocumentStore:

dicts = [
   {
       'text': DOCUMENT_TEXT_HERE,
       'meta': {'name': DOCUMENT_NAME, ...}
   }, ...
]

The process is unnecessarily laborious and error-prone. You’re better off going with one of the safer methods detailed above.

Deploy Haystack Question Answering System with Your Own Dataset

In this tutorial, you learned four methods for indexing documents in a production Haystack QA system. If you run into any difficulties implementing it, the Haystack team is more than happy to assist you. Reach out to us by joining our Discord community or starting a discussion on GitHub.

And if you enjoy using Haystack, please give our GitHub repository a star!