Document Search: The Art of Finding

A friendly introduction to search, retrieval, and their role in the large language model (LLM) era 

Document search is one of those technologies easily taken for granted. Thanks to modern search engines, we’ve all become accustomed to receiving relevant information from the vast trove of data that is the internet in a matter of seconds.

These days, however, document search is receiving renewed interest as a technique for making generative artificial intelligence (AI) safer. As a retrieval module, it can ensure that a generative model only sees documents from a reliable and fact-checked database. This makes document search the most effective method against hallucinations in large language models (LLMs).

But newbies who want to learn about document search are often bombarded with complex terminology and advanced mathematical concepts. We’ve written this high-level introduction to help more people understand the intuition behind document search and why it’s so good at solving so many of our problems.

What is document search?

Document search finds relevant documents in response to a query. In this context, the terms “query” and “documents” can both mean many things. To understand those meanings better, let’s look at a few examples from the real world that resemble document search:

  • A friend who knows that you enjoyed a certain book (the query) recommends you more like it (the documents).
  • A lawyer researches a case (the query) thoroughly and finds all the legal documents related to it (the documents).
  • A librarian retrieves the volumes (the documents) you need to study for an exam (the query).

Document search as a technology becomes necessary when manual searching is no longer sufficient: for example, when your collection of documents is too large to search manually or when you want to be able to perform many searches simultaneously and at any time. As a result, the vast majority of business applications that work with textual data require some form of document search.

A more traditional approach to the problem of search makes use of a document’s metadata: labels, which can easily be searched. This is common, for instance, in library systems, which categorize books by genre, author, time of writing, etc. This method has three main shortcomings, however: It requires manual work (labeling), it restricts search to a predefined set of labels, and it doesn’t capture a document’s content but just the categories associated with it at the time of labeling.

In order for a computer to search and compare text-based documents, they must be represented in a machine-readable format. This brings us to a fundamental concept in machine learning and natural language processing (NLP): the text embedding, or vectorized text.

Vectorized text = searchable text

In programming parlance, text is sometimes referred to as unstructured data. Such data is notoriously unpredictable with regard to its length and its content. To understand this better, let’s imagine we open a database of online news and pull out an article at random. Before looking at it, we can’t say how many characters it will have, which words it will use, or how long the paragraphs will be.

All these factors make textual data hard for a computer to handle. Vectors — a list-like data type in programming — offer a way of standardizing different pieces of text by transforming them into fixed-length lists of numbers. If this sounds complicated, it’s because it is! The good news is that we don’t need to understand the details of text vectorization. It’s enough to know that many different algorithms exist that can turn unstructured text into structured vectors.

The different methods for embedding text in vectors vary widely in their complexity, their theoretical underpinnings, the computational resources they require, and so on. But they all share the following properties:

  • They are able to transform the documents into vectors.
  • They measure how similar the query is to the documents by either vectorizing the query itself or comparing the words in the query to the words in the document vector.
  • They return the most similar documents in response to the query.

For a visually based introduction to this technology, check out our beginner’s guide to text embeddings.

Searching, finding, and changing the game

Now that we’ve gained a high-level understanding of what document search is and how it works, let’s turn to three of the most impactful applications of document search in today’s technical landscape.

Semantic site search

Nothing makes for a more frustrating user experience than not being able to find something when you know it is there. If you have a user-facing product based on text — such as a software documentation site, a platform for customer reviews, or an online database of scientific papers — offering a good search experience is the greatest service you can provide to your users.

A person is sitting at a desk. In front of them is a screen that says "Query: Can I give my dogs sweets?" And the reply: "You should never, ever, ever feed your pet chocolate or candy"

Semantic search is a specific type of document search that is able to encode semantic information — that is, the meaning of words and sentences rather than their lexical form. This makes for a more intuitive search experience. 

With semantic site search, your users don’t have to rack their brains to figure out how something is phrased within the database. Instead, they can ask their questions as they would in natural conversation — and receive answers based on their query’s meaning

Document retrieval

Language models like GPT and BERT are experts at understanding the nuances of natural language. However, they aren’t able to read through vast amounts of text: most of these models are only able to ingest a few documents at once. How can we automate the step of selecting the right documents to pass on to the language model? 

You probably know the answer to that question by now. Any type of document search — whether rooted in semantic or lexical similarity — can precede the finer-grained LM-powered step. As a retrieval module, document search acts as the sieve to the language model’s fine mesh. 

A sketch of an example pipeline with a retriever powered by document search and another node, which could be an LLM or a summarizer. A query enters the pipeline, an answer gets generated at the end.

The vast majority of industry LLM applications use document retrieval as a preselective mechanism for use cases like generative AI, extractive question answering, summarization, and others. This way, they make sure that the language model bases its answer on documents from a fact-checked database rather than providing outdated information or even hallucinations.

File similarity

Sometimes one document isn’t enough — you need more, or all, like it. This is especially true in areas where thorough research is critical, such as in legal and compliance industries. File similarity leverages document search to find the documents that are most similar to another document.

A person is standing in front of a screen, holding a book. On the screen it says: "Since you loved "Purple Hibiscus," here are a few other novels with a similar theme to check out: - The House on Mango Street, - Persepolis, - My Brilliant Friend, ..."

One of our clients, the Austrian publishing house Manz, has built a smart recommendation system that leverages file similarity for an improved and more intuitive user experience. Thanks to Manz’s NLP-driven product, lawyers and other legal professionals can identify useful documents faster and more systematically — just another testament to how document search makes text-based tasks not only more effective but also more enjoyable. 

Be atop the data revolution — with deepset

Document retrieval, semantic search, and file similarity are just a few examples of how NLP is revolutionizing how we interact with textual data. As data keeps accumulating in public and private warehouses, recent advances in language modeling have given us the tools to handle that data in novel and impactful ways.

At deepset, we want to empower everyone to build the search systems that best serve their use cases, driven by the impressive advances in large language models and NLP at large.

For more NLP insights and the latest on LLMs, check out the rest of this blog — and be sure to follow us on Twitter and LinkedIn. 🙂