The Beginner’s Guide to Text Embeddings

TLDR

Key Metrics:

Here at deepset, our goal is to empower everyone to benefit from the latest advances in natural language processing (NLP). The foundational technology for modern-day NLP is the text embedding, also known as a “vector”: without word and sentence vectors, there would be no cutting-edge translation apps, no automated summaries, and no semantic search engines as we know them. What’s more, embeddings can be used to represent other data types as well, like images and audio files.

‍

But while the basic idea behind embeddings is enticingly simple, it can be a bit hard to wrap your head around exactly how these data structures manage to represent words and longer text in a way that conveys their meaning, rather than just their lexical form. So in this blog post, we want to take a beginner-friendly approach to the technique known as text vectorization or text embeddings, and show how vectors power modern NLP technologies like semantic search and question answering.

Text embeddings and their uses

The term “vector,” in computation, refers to an ordered sequence of numbers — similar to a list or an array. By embedding a word or a longer text passage as a vector, it becomes manageable by computers, which can then, for example, compute how similar two pieces of text are to each other.

‍

The number of values in a text embedding — known as its “dimension” — depends on the embedding technique (the process of producing the vector), as well as how much information you want it to convey. In semantic search, for example, the vectors used to represent documents often have 768 dimensions.

There are two main types of embedding techniques for vectorizing text, known as “sparse” and “dense” respectively. Both consist of ordered numbers, but they differ both in their method of generation and in the type of text properties they can encode.

Sparse versus dense embedding techniques

Imagine you want to represent all the words in the famous line “to be or not to be” as computer-readable numbers. The simplest solution would be to assign each word to a number in alphabetical order. Thus, “be” would be represented as 1, “not” as 2, “or” as 3, and “to” as 4.

‍

The problem with this representation, however, is that it implies that the words can be put in order of size. But “to” isn’t in any way “bigger” than “be”. By encoding words as vectors instead of single numbers, NLP practitioners are able to represent all words in a uniform way. Each unique word is associated with a position in the vector, so that we can transform “be” to the vector [1, 0, 0, 0] and “not” to [0, 1, 0, 0].

Because these vectors consist mostly of zeros, they are known as “sparse” or “one-hot” encodings. Their dimension depends on the size of the vocabulary — the set of unique words in your text or text collection. Adapting this technique for larger units of text, we can represent the entire sentence “to be or not to be” by marking not just the occurrence of a word, but also how many times the word occurs in the passage.

And just like that, we have managed to encode our text as a vector (also known as a “bag of words” or “BoW” embedding — because it ignores the order of the words in the sentence).

‍

Over the years, researchers in NLP have come up with more elaborate techniques for producing sparse vectors — most notably, tf-idf and BM25. For a more in-depth look at these, have a look at our Guide to text vectorization. Rather than simply signaling the occurrence of a word, these advanced sparse techniques also compute a notion of how significant a word is. This helps to represent a text in terms of its characteristic words — the words it uses more often than other texts — rather than the words that occur most often overall. For example, in a fantasy novel, the characteristic words might be “dragon” and “wizard”, but the most frequent words might be “the” and “and”, just like in any other text.

‍

What all sparse word vectorization techniques have in common, however, is that

‍

their dimension depends on the size of the vocabulary
they can be computed quickly
they are context-free — there’s no way to encode which entity a pronoun refers to, for example
they can’t represent unknown words
they have no notion of semantics (the meanings of words and sentences).

‍

The last three properties are particularly indicative of these techniques’ shortcomings. Neural network-based embeddings like BERT address them by training a system that can turn words and larger units of texts into embeddings of a user-definable size. These vectors are dense rather than sparse, which means most of the values in them are non-zero.

‍

BERT and similar vectorization models produce vectors which

‍

have a fixed size
take a bit longer for the embedding process
have a notion of how words are related to each other — they encode the information that, for example, the pronoun he can refer to a male-presenting person
can represent previously unseen words through a technique known as “Byte Pair Embeddings”
are able to encode semantics.

‍

The dimensions of these vectors — usually a few hundred in number — represent different properties that words can have. These properties are a result of the training process, and, though abstract, likely imitate many of our own intuitions about the semantic and syntactic characteristics of words. (Once more, for a more detailed discussion of different dense embedding techniques, check out our text vectorization guide.) If all of this sounds a bit complicated, have no fear — the next sections provide some visual reference to help conceptualize the information contained in a dense vector.

Using dense vectors to plot meaning

To help think about how vectors can be visualized, consider the global coordinate system, where latitude is always in the first position and longitude in the second. In the context of the surface of the earth, there are only two axes to move across: north to south and east to west. Therefore, a vector of two dimensions, where we understand that the first number situates us north-south and the second east-west, allows us to represent any location on the planet, and to compute exactly how close they are to one another.

The same concept can be applied to text. Imagine that we have a collection of nouns that we’ve turned into word vectors, using a dense embedding technique. If we simplify these vectors with hundreds of dimensions to ones with only two dimensions(through mathematical dimensionality reduction techniques like t-SNE), we can plot them on a similarly designed two-dimensional grid. (For an interactive tool that lets you explore the relations between words in a three-dimensional space instead, check out TensorFlow’s Embedding Projector.)

Note how we’ve moved up one step — or rather, taken a big leap — on the abstraction ladder. The two axes of the coordinate system no longer represent familiar notions like north-south and east-west. But looking closely at the diagram, we can see how similar words form little clusters: there’s a group of vehicles in the lower left corner, and the upper right one seems to be about drinks. Note also how, within the clusters themselves, there are degrees of similarity: coffee and tea, both being hot drinks, are closer to each other than to lemonade.

As we’ve learned, text vectorization techniques are able to embed not only words as vectors, but even longer passages. Let’s say we have a corpus (that is, a collection of related texts) of dialogue scenes that we’ve turned into dense vectors. Just like with our noun vectors earlier, we can now reduce these high-dimensional vectors to two-dimensional ones, and plot them on a two-dimensional grid.

By plotting the dense vectors produced by our embedding algorithm on a two-dimensional grid, we can see how this technology is able to emulate our own linguistic intuition — the lines “I’ll have a cup of good, hot, black coffee” and “One, tea please!” are, while certainly not equivalent, much more similar to each other than to any of the other lines of dialogue. Can we now go from here to an exact metric of how similar or dissimilar two pieces of text are? Of course we can!

Measuring the results

When we compare two vectors, it doesn’t matter what they represent — pairs of coordinates, pieces of text, or even images. We can simply apply general metrics for calculating the distance between them.

‍

There are common geometrical distance metrics like cosine similarity or the dot product, which are all suitable for comparing vectors. Note that, when we calculate the distance between two vectors mathematically, we don’t have to reduce them to two values like we did for visualizing them. That’s because linear algebra can easily handle those vectors’ high dimensionality (provided that they have the same number of dimensions), and thus arrive at more precise statements about how similar — or dissimilar — two vectors are.

‍

If our methodology for creating the vectors generated real semantic insight, we have therefore achieved the initial task: create semantic, computer-readable representations of text that can be compared to one another. These representations are, finally, the basis for most modern NLP-powered technologies like semantic search, translation, question answering, and automatic summarization.

Text vectors as the basis for all modern NLP

Both sparse and dense vectors can be used to find matches from a collection of documents, based on a query. But while sparse vectors can only match keywords literally, dense vectors are able to match a natural-language query on the basis of its semantics.

‍

That’s why semantic search is often used to manage large collections of documents through an intuitive search function. It is also the basis for advanced question answering (QA) systems. Imagine, for example, that you’re looking for an answer from a corpus of technical manuals. You know that the answer to your question is somewhere in there — you just don’t know where exactly.

‍

Rather than forcing you to try out different combinations of keywords to match the exact terminology used in the manuals, a modern QA system will allow you to query the corpus by asking questions in natural language, such as “How do I fix my fridge light?” or “Where is the button to reset my device?”.

If the system is set up properly, it will provide the answers to your questions — even if the underlying documents don’t use the same words. As we’ve seen, all this is made possible by the technology of dense text vectors.

Diving deeper into text vectors

Text vectors are the foundation for most modern-day NLP tasks. If you want to cut to the chase and start using semantic search technology for your own applications, have a look at our Haystack framework for composable NLP.

‍

Head over to our GitHub repository or chat about semantic search with our community on Discord!