Metadata Filtering in Haystack

How to make use of metadata filtering to boost the quality of answers in Haystack.

This article is the fourth in our series on optimizing Haystack question answering systems. We’ll link to the other articles here as they go online:

Language data is tricky. It’s full of redundancies and omissions, which often render it ambiguous. But thanks to Transformer-based language models, computers can now process natural language pretty well. However, these models are rather slow and computationally costly. That’s why it’s crucial to pre-filter large text collections in order to avoid unnecessary computations.

Metadata can greatly assist our deep language models by quickly providing them with a preselected section of the data pool. In this article, we’ll look at how to make use of metadata filtering to boost the quality of your answers in Haystack.

What Is Metadata?

Metadata is literally “data about data.” It has great value in machine learning applications: it helps us structure, filter and make sense of our data. Some metadata is simply a by-product of the data creation process. For instance, when a news article is published online, it invariably comes attached with a timestamp value.

Other types of metadata require more elaborate processes to create. For example, news articles can be manually annotated for topics. But whenever human labor is involved in the metadata creation process, it becomes expensive. That’s one reason why high-quality labeled data is so precious.

Metadata can take on many forms. It can be represented by the Booleans “True” and “False,” continuous values like integers and floating point numbers, or categorical data like the topic that an article belongs to. All of these formats have something in common: they’re all instances of “structured data,” meaning that they present data in a way that’s predefined and easily searchable.

Why Is Metadata Useful in Question Answering?

Language, on the other hand, is a prime example of “unstructured data.” The same idea can be expressed in infinitely many ways, and it’s clearly impossible to predict every single one of them. Transformers do a pretty good job of capturing statistical correlations in language data and estimating meaning. Combine that with the additional signal provided by structured metadata — and you’ve got yourself a powerful QA system.

If our text data comes with metadata, we can easily exploit that fact in a question answering task — by passing a filter to our retriever-reader pipeline. The retriever then only preselects those documents that match our filter. As a result, we greatly reduce the search space for the rest of the pipeline.

A Metadata Filtering Use Case

Metadata filtering is particularly rewarding in a scenario in which we have a large collection of texts and/or know exactly which slice of the data contains our answers. Imagine that your company wants to implement a question answering system that allows it to do competitor analysis. So your team starts collecting reports on your competitors, which are then saved with names and dates attached.

In your analysis, you want to look at one competitor at a time. To that end, you pass a filter to the retriever with a company’s name, and the financial years you’d like to investigate. You can then freely word your questions to the system, without explicitly stating the year and competitor you’re interested in. Not only does this improve your search speed, but it also increases the likelihood of high-quality answers, since we’ve taken the documents that are irrelevant to our search out of the equation.

Metadata Filtering in Haystack: A Practical Example

For our example, we’ll be working with a subset of the Amazon Review Dataset that consists of around 50,000 reviews of items from Amazon’s office supplies category. Each review comes with some metadata, such as the reviewer’s name and ID, the time of publication, and a short review summary.

We’ll be working with the product ID that goes by the name “asin.” To see an example product from the dataset, follow this link. If you want to look at a different product, simply change the ID at the end of the URL.

Each product has multiple reviews, which can be bundled by filtering for a product ID. Thus, even though this is quite a big dataset, filters allow us to really only look at a tiny subset of the data. Let’s read our dataset into an Elasticsearch database (check out this blog post to learn how to set up Haystack with Elasticsearch).

Note: Metadata filtering is only possible in combination with Elasticsearch and Weaviate document stores.

import pandas as pd
reviews = pd.read_json(‘Office_Products_5.json’, lines=True)
texts = reviews.reviewText.values
ids = reviews.asin.values

We’ve extracted both the review texts and their ID metadata from the json file using the pandas library. We now have to convert this data into the format that our document store expects. All the data points are turned into dictionaries with a ‘text’ key and a ‘meta’ key. The latter is for our metadata: here, we specify a dictionary which so far contains only the item’s ID.

dicts = [{'text': text, 'meta':{'item_id': id_}} for text, id_ in zip(texts, ids)]

To see what the format looks like — and to get a feel for our data — let’s have a look at one of the dictionaries in our list:

import random

random.choice(dicts)

>>> {'text': "The first set of cartridges worked perfectly...as they should.  They aren't shipped in the retail box, but individually sealed in the same air-tight plastic packaging that's in the box when it's opened.Absolutely no complaints.  :>)", 'meta': {'item_id': 'B003J9XN76'}}

Since we’re going to work with the dense DPR retrieval method, we let the preprocessor split our reviews into chunks of length 100 and an overlap of five words.

Note: for more information on splitting in the context of DPR, take a look at this article.

processor = PreProcessor(split_by='word', 
   split_length=100,
   split_respect_sentence_boundary=False,
   split_overlap=5)
docs = [processor.process(d) for d in dicts]
flattened_docs = [d for list_of_dicts in docs for d in list_of_dicts]

Note: The next Haystack release will include an amended process() function which takes in and returns a list of documents, shortening the last two lines to: docs = processor.process(dicts).

How do our documents look now?

random.choice(flattened_docs)

>>> {'text': "Done have much to say about it. It work fine. Don't see the different between this and any other mouse pad. It just look and feel like a plastic that is cut into the shape of a mouse pad.", 'meta': {'item_id': 'B0017D5Z40', '_split_id': 0}}

After splitting, the meta dictionary has received another key that specifies the split ID of a document. This is useful if we’re working with long documents that we want to reconstruct from the chunks later on. In general, the ‘meta’ dictionary can hold as much metadata as we think is useful. In the next step, we initialize our document store. We then delete any leftovers from the database and read in our flattened list of document dictionaries.

document_store = ElasticsearchDocumentStore(similarity="dot_product")
document_store.delete_documents()
document_store.write_documents(flattened_docs)

We then start up the retriever, index our database, and load a RoBERTa model fine-tuned on SQuAD as our Reader:

retriever = DensePassageRetriever(document_store=document_store)
document_store.update_embeddings(retriever)
my_model = "deepset/roberta-base-squad2"
reader = FARMReader(model_name_or_path=my_model, use_gpu=True, return_no_answer=True)

Finally, we combine retriever and reader in a pipeline:

pipeline = ExtractiveQAPipeline(reader, retriever)

Now that our pipeline is up and running, let’s ask some questions. Let’s say that we want to learn more about this handy office supply item. We’ve chosen it because it’s the product with the most reviews:

reviews.groupby('asin').size().sort_values(ascending=False).head(1)

>>> asin
B0010T3QT2 311
dtype: int64

This envelope has 311 reviews — a decent-sized knowledge base. To let our system know that we’re only interested in reviews about this item, we write a filter that closely resembles the meta dictionaries in our database:

filter = {'item_id': ['B0010T3QT2']}

Note that the values in the filter dictionary need to be included in a list. It’s time to ask our question:

q = 'How well does this envelope stick?'
answers = pipeline.run(q, top_k_retriever=30, top_k_reader=3, filters=filter)

Our pipeline returns the result in the form of a dictionary which holds a lot of information. We use a custom print function which only prints out the answers and their probability values:

def short_answers(answers):
    for i, answer in enumerate(answers['answers']):
        print('Answer no. {}: {}, with {} percent probability'.format(i, answer['answer'], round(answer['probability'] * 100)))

Let’s look at our system’s answers:

short_answers(answers)

>>> Answer no. 0: They hold up well, with 66 percent probability
>>> Answer no. 1: seal very effectively, with 60 percent probability
>>> Answer no. 2: no-lick seal, with 60 percent probability

Even if our question wasn’t perfectly worded, we still got some useful information. If we want to obtain more information about the extracted answers, such as their context, we can use the print_answers() function. Since the output of that function is quite wordy, let’s just look at the first answer:

from haystack.utils import print_answers

print_answers(answers, details='all')

>>> {   'answers': [   {   'answer': 'They hold up well',
        'context': 'As envelopes these are fairly sturdy and '
            'good quality. They hold up well, the flap '
            'stays down, the adhesive stays sticky, and '
            "you don't have to lick t",
        'document_id': '88614629e0ff8bf2caf56332184cd18b',
        'meta': {'_split_id': 0, 'item_id': 'B0010T3QT2'},
        'offset_end': 72,
        'offset_end_in_doc': 72,
        'offset_start': 55,
        'offset_start_in_doc': 55,
        'probability': 0.6619654893875122,
        'score': 10.91701889038086},
...

For comparison, we asked the same question on the entire database. The system took slightly longer for the answers, but more importantly, we can’t be sure that they are really about the product we’re interested in:

answers = pipeline.run(q, top_k_retriever=30, top_k_reader=3)
short_answers(answers)

>>> Answer no. 0: firmly, with 60 percent probability
>>> Answer no. 1: quickly and smoothly, with 79 percent probability
>>> Answer no. 2: strong enough to adhere to an envelope without worries, with 79 percent probability

Did the system really understand our query? The third answer, for example, doesn’t really seem to answer our question. How does it look in context?

print_answers(answers, details='all')
>>> ...
        {   'answer': 'strong enough to adhere to an envelope '
                     'without worries',
              'context': 'These have never jammed my printer, are '
                    'easy to peal off, and are strong enough to '
                    'adhere to an envelope without worries.',
              'document_id': 'e86c6890c8ef0a542c48969cba7fd52a',
              'meta': {'_split_id': 0, 'item_id': 'B00004Z5SM'},
              'offset_end': 120,
              'offset_end_in_doc': 120,
              'offset_start': 66,
              'offset_start_in_doc': 66,
              'probability': 0.7904667854309082,
              'score': 10.183916091918945}],

Clearly, that answer was about the adhesive power of a different item. With filtering, we wouldn’t have had to worry about these issues. Now what if we were to ask about a general class of products? As we saw, we can pass our filter conditions as a list within a dictionary. That list can easily be expanded to contain more items. This time, however, let’s look at a different kind of office supply:

filter = {'item_id': ['B00006IBLJ', 'B000GHJM9C', 'B000CS787S']}
q = 'Can things still break when they\'re wrapped in bubble wrap?'
answers = pipeline.run(q, top_k_retriever=100, top_k_reader=3, filters=filter)

>>> Answer no. 0: no breakages, with 81 percent probability
>>> Answer no. 1: It tears, with 72 percent probability
>>> Answer no. 2: They still managed to break some things somehow, with 34 percent probability

Bad news for bubble wrap fans? As you can see, this is quite a fun dataset. We encourage you to download it and play around with it to keep exploring the possibilities of filtering with Haystack. For instance, why not add the reviewers’ IDs to meta dictionaries, and investigate what particular people had to say about different office supply items?

If you want to learn more about the item and reviewer IDs, there’s even an entire dataset about that. You can find it on the same page as the review dataset itself. Metadata about metadata!

Ask Questions About Your Data with Haystack

If your dataset contains metadata, why not take advantage? It is super easy to do filtering for metadata with Haystack. Check out the Haystack repository on GitHub to start building your own question answering systems. Last but not least, we would greatly appreciate you giving us a star as well :)