Automating Information Extraction with Question Answering

How to use Haystack-based semantic question answering systems to extract information from a collection of financial statements.

19.01.22

Thanks to the Transformer-based language models and semantic search, modern natural language processing systems have an almost uncanny ability to answer questions. This enables developers to interact with data in a way that’s highly intuitive — by asking questions in natural language.

This article will demonstrate how extracting information such as a company’s earnings is equivalent to asking a question like “What are the company’s earnings?” to a question answering system (QA). Keep reading as we show you how to use Haystack NLP framework to build an information extraction pipeline and apply it to a corpus of financial statements.

What Is Question Answering for Information Extraction?

Information extraction (IE), as the name suggests, refers to the process of distilling a large amount of unstructured text data into its most important components. Invoices, application forms, patient records, and many other types of documents all contain a lot of important information. But not all of it is useful, and manually combing through a document set to find one specific fact is labor-intensive, time-consuming, and error-prone. When it comes to automation, simple technologies like regular expressions and rule-based approaches that rely on filling predefined templates have been state-of-the-art for a long time. That is, until now.

Question answering lets us employ natural language to interact with our textual or tabular data. Rather than relying on complex regex patterns that may or may not capture the information that we’re looking for, QA allows us to directly ask a question like, “What’s the patient’s diagnosis?” and receive the relevant information. Interacting with data in this way is extremely flexible since you don’t need to match the words and can use different wordings to find the same thing. Question answering is also powerful because virtually anyone can use it, regardless of technical background. Our open source framework Haystack provides all the tools to build flexible and scalable question answering systems.

Benefits of Automating Information Extraction

Most business processes generate data. As our digital footprints grow larger, it becomes increasingly more difficult for companies to manage all of the accumulated data. Enter automated information extraction, which can significantly reduce the manual effort required to preserve, understand, and organize large text collections.

Ad hoc information extraction

You can use automated information extraction on an ad hoc basis. For instance, you might have a set of documents about which you might want a general idea. Haystack’s Summarizer node can be used for a similar purpose, but using question answering gives you more control over the received information. To use QA for information extraction, you simply need to come up with your questions, pass each document through a question answering pipeline, and retrieve the answers to quickly learn more about the documents.

Say you want to get approval for selling a product and you need to learn the compliance requirements. This would be a great one-off task for question answering. You would come up with a set of relevant questions to update you on the guidelines, feed those questions into your QA system, and receive a digest of the compliance documents.

Automated information extraction workflows

Automating your information extraction workflows is especially useful if you have to routinely review many documents to find answers to the same questions. For instance, you may have a large collection of invoices from which you need to extract supplier information and the amount charged. Depending on the complexity of questions, the tolerance for incorrect or missed answers, and the documents’ domain, you may be able to partly, if not fully, automate your information extraction workflow.

If you’re working with a highly specific domain, like medicine, and your questions turn out to be too challenging for your QA system, you can opt for partial automation. In this scenario, you automate the questions that your system processes correctly, and resort to manual answering for the ones that it misses. Even partial automation can save you time and provide a significant productivity boost.

On the other hand, if you’re working with plain language texts, you can reap all the benefits of question answering for information extraction by fully automating your IE workflow! The next section will show you one way of doing this as we work on a corpus of financial statements.

Information Extraction with Question Answering: A Practical Example

In this tutorial, you’ll be working with Adidas 2020 annual report. The document consists of several hundred pages of information about the company’s financial, social, and environmental performance. The code examples in this tutorial rely on Haystack version v1.0.0.

Problem setting

Let’s say you’re an investment analyst who has been tasked with investigating a company’s financial health and legal situation. In other words, you need to conduct due diligence on the company. You would ask questions like:

  • What are the company’s earnings?
  • What are risks the company is exposed to?
  • How does the capital structure look like?

We’ll show you how easy it is to extract information with Haystack. All you’ll need to do is come up with the right set of questions and your question answering system will do the rest. But first, you’ll need to index your documents into a database.

Preprocessing and indexing documents

The report you will be working with is in the form of a PDF file, so start by creating a PDFToTextConverter. This node extracts the textual information from the PDF and saves it into a Python list. 

Set the remove_numeric_tables parameter to True. This is fine to do since the report does not contain many tables. Where this is not the case, you may want to use Haystack’s newly implemented AzureConverter, which is optimized for working with tabular data.

from haystack.nodes import PDFToTextConverter

pdf_converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[“en”])

converted = pdf_converter.convert(file_path = ”2020_Annual_report.pdf”, meta = { “company”: ”Company_1", “processed”: False })

Note that you’ll be adding metadata to the converted documents. Once you index this document (as well as all subsequent documents) into a database, the document’s metadata will allow you to filter them. For instance, by filtering based on the ‘company’ metadata, you can specify that you only want to ask a question applicable only to documents concerning ‘Company_1’ and ‘Company_3’ rather than all documents in the database.

Now you’ll build a PreProcessor to split your documents into chunks of length 200 and an overlap of ten words.

from haystack.nodes import PreProcessor

preprocessor = PreProcessor(split_by=”word”,
                            split_length=200,
                            split_overlap=10)

preprocessed = preprocessor.process(converted)

With this out of the way, you can now index your documents into an Elasticsearch database.

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
document_store.delete_all_documents()
document_store.write_documents(preprocessed)

You’re now ready to build a question answering pipeline.

Building a Question Answering Pipeline

Question answering pipelines consist of two main components — the Retriever and the Reader. You can initialize these with only a few lines of code:

from haystack.nodes import DensePassageRetriever, FARMReader

retriever = DensePassageRetriever(document_store=document_store)

reader = FARMReader(model_name_or_path=”deepset/roberta-base-squad2", use_gpu=True)

Once initialized, you will also need to update your database with dense vector embeddings. This is as simple as executing a single line of code:

document_store.update_embeddings(retriever)

With both your retriever and reader in place, you can place them into a Pipeline.

from haystack.pipelines import ExtractiveQAPipeline

pipeline = ExtractiveQAPipeline(reader, retriever)

It’s time to put your QA system to use.

Information Extraction with Question Answering

Before you can use our pipeline, you’ll need to come up with some questions. Let’s create a few below:

questions = [ “How high is shareholders equity?”,
              “What are the major risks?”,
              “What is the number of shares outstanding?”,
              “How high is short term debt?” ]

Next, iterate over the questions and feed them into your pipeline.

Set the top_k parameters to 50 and 1 for the retriever and the reader, respectively. A top_k value of 50 for retriever is comparatively high and may slow down a question answering system with many active users. For a QA system in production, the higher speed achieved by decreasing the top_k parameter is generally worth a small loss in accuracy. However, because right now your information extraction system probably doesn’t have to serve hundreds of users and you’ll want the returned results to be as accurate as possible, you’re better served by amping the top_k value up. As for the reader, you’ll only be interested in the single highest-ranking answer, so set its top_k value to 1.

answers = []

for question in questions:
    prediction = pipeline.run(query=question,
                 params = {“Retriever”: {“top_k”: 50},
                           “Reader”: { “top_k”: 1 } })
    answers.append(prediction)

For each question that you’ll pass through the pipeline, it returns a dictionary that contains both the query and the answer. The pipeline saved each of these dictionaries into a list, so to print out the results, simply iterate over the list and output the dictionary’s values by accessing them via their corresponding keys.

for answer in answers:
    print(“Q:”, answer[‘query’]))
    print(“A:”, answer[‘answers’][0].answer)
    print(“\n”)

>>> Q: How high is shareholders equity?
A: 6.454billion

Q: What are the major risks?
A: continuously overlooking new trends and failing to continuously introduce and successfully commercialize new product innovation

Q: What is the number of shares outstanding?
A: 195,066,060

Q: How high is short term debt?
A: 686 million

The results look right, and they have been validated for correctness. And just like that, you have extracted answers from a 300-page document in a matter of seconds. You can convert these results into a pandas dataframe, or store them in a database for further analysis and processing.

Metadata Filtering

If you’re working with an incoming stream of documents from which you regularly extract information, ensure that you don’t process the same document more than once. You probably don’t want to build a fresh document store for each new batch of documents that you receive; after all, you never know when you might want to revisit the documents you’ve already processed. At the same time, you don’t want to waste time processing and filtering out irrelevant documents. Luckily, you can select the relevant subset of documents with metadata filtering.

Earlier in the tutorial, during the preprocessing step, we added a piece of metadata (meta = { “company”: ”Company_1" } ) to our document. By adding the company’s name to the documents’ metadata during indexing, you can query the documents pertaining to your company of interest. Creating a filter is as easy as defining a Python dictionary:

filter = { “company”: [“Company_1”, “Company_3”, “Company_5”] }

To apply the filter, simply pass it to the pipeline’s run() method:

prediction = pipeline.run(query = question,
             params = { “Retriever”: {“top_k”: 50 },
                        “Reader”: {“top_k”: 1}},
             filters=filter)

As the pipeline runs, it will only consider the documents whose ‘company’ metadata value passes the filter.

Additionally, you can also make use of the ‘processed’ metadata value. Recall that you had initially set it to False. However, when working with an incoming flow of documents, it’s useful to tag them after they have been analyzed. You can achieve this by changing the ‘processed’ value from False to True and exclude these documents from future analysis by specifying the relevant condition as a filter.

Learn More About Haystack

In this article, you learned how to use Haystack’s question answering capabilities to aid information extraction workflows. Modern NLP systems contain many moving parts and building each from scratch is not easy.

Luckily, Haystack provides ready-made solutions for many NLP tasks, including question answering, semantic search, and even data labeling! Head over to our GitHub repository to learn more about Haystack. If you like what you see, don’t hesitate to give us a star!

If you get stuck using Haystack and need help, or want to share your Haystack-powered NLP system, join our Discord and connect with other Haystack users and the deepset team.