Question Answering on Structured Data

TLDR

Key Metrics:

Slowly but surely, natural language is becoming the primary interface to data. You can now do semantic question answering not only on unstructured text, but also on structured tabular data. Tabular question answering is a powerful tool for extracting and aggregating information from tables and mixed table-text formats, as you would find in financial statements or technical documentation.

‍

In this article, we introduce Haystack’s newest tools for question answering on tables. We’ll be covering some concepts and terminology, before moving on to practical code examples of the tools in action. Learn how to use the Haystack NLP framework to extract tables from documents, retrieve best results from a table corpus, and do question answering on tabular data.

What is structured data?

Structured data is presented in a standardized format. This makes structured data readily processable by computers. For instance, a two-dimensional table follows the format of columns on the x-axis, and rows, or records, on the y-axis. The columns normally represent features, while the records stand for individual data points. Consider this table of 100-meter dash world record times, ordered by gender and geographical region (from Wikipedia):

With a little practice, you can readily start looking up information in a table — simply pick the rows and columns that interest you. For example, if you wanted to know the Asian men’s record holder, you would simply retrieve the “Athlete” column from the row labeled “Asia” in the left half of the table.

‍

As another example, you might want to know the average time of all the regional women’s champions. To find out, you’ll need to perform an aggregation operation: Add the values in the “Time” column in the right half of the table and divide the result by the number of records.

‍

In a query language like SQL, such operations are formalized in statements. For example, to get the same information, you can write a query along the lines of:

‍

SELECT AVG(Time) FROM records_by_region WHERE gender = ‘Women’;

‍

Unstructured data, on the other hand, does not follow a standardized format and does not have any predefined input categories. Text is a prime example. A sentence can have any length, and its words draw from an infinite vocabulary. That is why attempts at formalizing natural language in a manner similar to SQL have been largely replaced by Transformer-based language models.

Question answering on structured data

To us humans, language is a very natural way of interacting with the world. Thanks to new Transformer-based models that have been trained specifically on tabular data, you can now pose questions in natural language to a collection of tables, or even a mixed corpus of tables and text documents. For example, you can ask, “Who’s the fastest Asian woman?” and get the answer “Li Xuemei.”

‍

The benefit of a Transformer-based approach over a query language like SQL is apparent. Not only is it more intuitive to ask a question in natural language, but the Transformer’s semantic approach to language also means that you don’t need to know the exact names of your columns and records; the model itself can infer the semantic relationship of a word pair like “fast” and “sprint.”

‍

This is great news for when you have documents that contain tables. You can now perform question answering on any tabular data, like CSV, JSON, and Excel files, and SQL exports and pandas dataframes. Thanks to Haystack’s newly implemented AzureConverter, you can even extract tables from your documents and pass them on to Haystack’s tabular data tools. In the following section, we’ll take a closer look at these tools.

Haystack tools for question answering on tables

Most semantic question answering systems have two main components: a retriever and a reader. The retriever extracts the most suitable documents from a database in response to a query. The reader then reads those documents and returns the most likely answers. The two modules are often combined in a pipeline to allow for a seamless question answering process on a large collection of documents. To process structured tabular data, you will need to work with retriever and reader models that have been trained specifically on datasets with structured tabular data.

TableTextRetriever: Do retrieval on a mixed corpus

Our new TableTextRetriever can do retrieval on a corpus of both tables and texts. The model uses three separate encoders for the query, texts, and tables, as we’ve found that this approach produces the best results. If you’re interested in the details, make sure to check out our paper “Multi-modal Retrieval of Tables and Texts Using Tri-Encoder Models” to learn more.

‍

All three of the TabelTextRetriever encoders rely on BERT language models. Since BERT is only equipped to work with one-dimensional data, tables have to be “flattened” before they’re encoded — all their components, like the title, caption and rows, are concatenated before they’re fed to the model. The retriever then matches the query embedding to the table and text embeddings in the corpus and returns the best candidate documents.

Example of the TableTextRetriever in action

We’ll be working with a subset from the “Open Table-and-Text Question Answering” (OTT-QA) dataset, which is made up of texts and tables from Wikipedia, as well as questions to query them. For this example, we have chosen a subset of 1000 tables from OTT-QA and have added them to our Elasticsearch document store (to see the full code, see our Table QA tutorial).

‍

The selection of tables in our sub-corpus is random and therefore quite varied. There is one table titled “List of shipwrecks in January 1917” and one called “Zendaya discography.” A lot of the tables however are competitive sport-related, similar to the one you saw earlier. We start up the retriever and use it to update the table embeddings in the data store:

‍

from haystack.nodes.retriever import TableTextRetriever

query_embedding_model=”deepset/bert-small-mm_retrieval-question_encoder”

passage_embedding_model=”deepset/bert-small-mm_retrieval-passage_encoder”

table_embedding_model=”deepset/bert-small-mm_retrieval-table_encoder”

retriever = TableTextRetriever(document_store, query_embedding_model, passage_embedding_model, table_embedding_model, embed_meta_fields=[“title”, “section_title”])

document_store.update_embeddings(retriever=retriever)

‍

We are now ready to ask the first question:

‍

retrieved_tables = retriever.retrieve(“Who is the fastest runner?”, top_k=5)

‍

The retriever returns the top five results for our query. Since printing out the full tables would take up too much space, let’s just look at the tables’ titles (open up the Table QA tutorial in Colab if you want to find out more about the documents):

‍

{i: doc.id for i, doc in enumerate(retrieved_tables)}

‍

This returns the following titles:

‍

1‍{0: “Europe’s_Strongest_Man_0”, 1: 
2“2001_World_Championships_in_Athletics_-_Men’s_110_metres_hurdles_2”, 2: 
3“Athletics_at_the_2010_Asian_Games_-_Women’s_1500_metres_2”, 3: 
4“Athletics_at_the_1999_Summer_Universiade_-_Men’s_110_metres_hurdles_0”, 4: 
5“Open_water_swimming_at_the_2009_World_Aquatics_Championships_-_Women’s_10_km_0”}

‍

For this example, we worked with a collection of tables. However, the TableTextRetriever has been designed to handle mixed collections of documents. In Haystack, you can use the AzureConverter to turn PDFs and other file formats into a corpus of tables and text passages. To work with the converter, which uses Azure’s Form Recognizer, you’ll need an Azure account. The converter returns a JSON file that contains the tables and the text from a document.

TableReader: Question answering on tables

The TableReader in Haystack is the node for doing question answering on tables. It accepts a QA model that has been trained on tabular data. By default, the TableReader uses Google’s TaPas model. TaPas is equipped to encode a table’s structure along with its content, providing separate embeddings for column, row and rank indices. To answer a query, the model selects the cells from the table that contribute to the answer and aggregates them automatically.

Example of the TableReader in action

The TableReader accepts a table and returns an answer, based on the TaPas model’s reading of the table. Let’s set up the TableReader and repeat our question from earlier, passing it one of the tables we retrieved above.

‍

‍from haystack.nodes import TableReader

reader = TableReader(model_name_or_path=”google/tapas-base-finetuned-wtq”, max_seq_len=512)

‍

From the results returned by the retriever, the third table looked the most promising, so we’ll pass it on to our reader:

‍

table = document_store.get_document_by_id(retrieved_tables[2].id)

prediction = reader.predict(query=”Who’s the fastest runner?”, documents=[table])‍prediction[‘answers’][0].answer

>>> Maryam Yusuf Jamal ( BRN )

‍

That is the correct answer. In addition to the answer, the model returns the positions of the cells it considered for the answers, and information on whether an aggregation function was used.

‍

Because of the underlying representation — TaPas, as a language model, works only with strings — we need to cast our time data in the right format before performing any aggregation function on it. In order for the next example to work, we converted the time strings to seconds using pandas’ to_timedelta() and back into strings before adding the table to the document store.

‍

Consider the following example:

‍

prediction = reader.predict(query=”How fast were the runners on average?”, documents=[table])‍

prediction[‘answers’][0].answer

>>> 257.99636363636364

‍

Remember that this answer is a string representation of the time value in seconds. How did the model arrive at this result? Let’s find out by printing out the answer’s metadata:

‍

prediction[‘answers’][0].meta

>>> {‘aggregation_operator’: ‘AVERAGE’,

‘answer_cells’: [ 

‘248.22’, 

‘249.58’, 

‘250.42’, 

‘251.76’, 

‘253.46’, 

‘256.42’, 

‘259.62’, 

‘259.99’, 

‘261.88’, 

‘267.73’, 

‘278.88’]}

‍

A quick look at the metadata reveals that yes, it aggregated the right cells, i.e., the entire “Time” column.

Setting up a pipeline for Table QA

We can combine TableTextRetriever and TableReader in a pipeline object to allow for question answering on a large collection of tables.

‍

from haystack import Pipeline

table_qa_pipeline = Pipeline()

table_qa_pipeline.add_node(component=retriever, name=”TableTextRetriever”, inputs=[“Query”])

table_qa_pipeline.add_node(component=reader, name=”TableReader”, inputs=[“TableTextRetriever”])

‍

Let’s run our pipeline with a new question:

‍

prediction = table_qa_pipeline.run(“Who is the best chess player?”)

prediction[‘answers’][0].answer

>>> Otto Löwenborg 3 — Anton Olson 2

‍

Given the limited amount of knowledge in our small database, this result seems to make sense — it comes from a table about the Swedish Chess Championship. Note that at the moment, the pipeline only works on tables and not on a mixed corpus of tables and text documents.

Get started with Haystack’s Table QA modules

Are you as excited as we are about the current state of table question answering? If you routinely look for answers from your tabular data, the answer is probably yes. Head over to our GitHub repository to try out our nodes for table retrieval and question answering for yourself. If you like what you see, please don’t hesitate to give us a star :)

‍

We also invite you to join our Discord community, where you can stay in contact with other Haystack users and the deepset engineering team.

Curious about building AI Apps and Agents?

Book Demo

meet the author

Isabelle Nguyen

Technical Content Writer

Table of Contents

What is metadata?

GET STARTED WITH A PERSONALIZED DEEPSET DEMO

Book demo