Semantic search with embeddings: index anything

Building scalable semantic retrieval from image, text, graph, and interaction data

Links Complete list of references

We live in a world with an explosion of information. There are millions of clothes, songs, movies, recipes, cars, houses, which one should you pick? Semantic search can find the right one for any taste and wish!

In this article, I will introduce what is semantic search, what can be built with it, and how to build it. For example, why do people look for clothes? They like the brand, the color, the shape, or the price. All these aspects can be used to find the best one.

The color and shape can be found using the image, and the price and brand are found in trends.

Images and trends can be represented as small vectors called embeddings. Embeddings are the core of semantic search: once items are encoded as vectors, it’s fast and efficient to search for the best items.

I will explain how semantic search works: encoding items as embeddings, indexing them, and using these indices for fast search in order to build semantic systems.

  1. Motivation: why search, why semantic search?
    1.1 What is search?
    1.2 What is semantic search and what can be built with it?

1. Motivation: why search, why semantic search?

1.1 What is search?

For thousands of years people have wanted to search amongst documents: think about the huge libraries containing millions of books. It was possible then to look in these books thanks to people carefully sorting them by author names, publication date,… Indices of the books were carefully built and it was possible to find books by asking experts.

30 years ago the internet became popular and with it, the number of documents to search from went from millions to billions. The speed at which these documents went from a few thousand every year to thousands every day, it wasn’t possible anymore to index everything by hand.

That’s when efficient retrieval systems were built. Using appropriate data structure it is possible to index billions of documents every day automatically and query them in milliseconds.

Search is about fulfilling an information need. Starting from using a query of any shape (questions, list of items, images, text documents,…), the search system provides a list of relevant items. Classical search systems build simple representations from text, image, and context and build efficient indices to search from them. Some descriptors and techniques include

Although these systems can be scaled up to very large amounts of content, they often suffer from difficulties to handle the meaning of the content and tend to stay at the surface level.

These classical retrieval techniques provide solid foundations for many services and applications. However, they cannot fully understand the content they are indexing and as such cannot answer in a relevant manner some queries over some documents. We will see in the next sections how embeddings can help.

1.2 What is semantic search and what can be built with it?

The main difference between classical search and semantic search is to use small vectors to represent items.

Visual search can be used to look for plants PlantNet, but also to look for clothes

Using embeddings is powerful: it can be used to build systems that can help users find items they like (music, product, videos, recipes, …) using many kinds of queries. It can not only work in explicit search systems (inputting a query in a search bar) but also in implicit ones (relevant products in retailer websites, personalized news in publishers, interesting posts on social platforms).

Many kinds of systems can be built on top of search.

  • A text search system takes as input a text query and returns results: searching for clothes, for songs, for news

It is possible to search for a variety of items, anything that has pictures, text, audio, or is available in a context. Popular examples of such systems are Google lens, Amazon recommendation, or newer ones for fashion search, plant search, …

On a smaller scale, it can be interesting to index your pictures, your messages, to find a tv show among many, to find actors in a tv show, …

2. Overall design: how to build semantic search?

A semantic search system is composed of two parts: an encoding pipeline that builds indices, and a search pipeline that lets the user use these indices to search for items.

3 Encoding pipeline: from items to vectors and indices

The first step to build a semantic retrieval system is to encode items into small vectors (hundreds of dimensions). This is possible for many items and can then be used to index them and search amongst them efficiently.

3.1 Extract relevant data

Retrieval systems can encode items from many different aspects, so it’s important to think about what to encode. Some examples of items to encode are clothes, toys, animals, news, songs, movies, recipes. Each of them has different characteristics: they can be expressed by how they look like, how they can be described, how they appear amongst other items.

All this information can be encoded as embeddings. A different axis to think about is how many items to encode? Are all the items unique or does it make more sense to group them by relevant characteristics? Are there some items that are more relevant and should be a priority? Making that choice early can have dramatic consequences on the rest of the system.

Star Wars character C-3PO
This Star Wars character C-3PO can be encoded with a picture of it, a description, how it appears in a graph (it’s in a star wars movie, appearing at the date of these movies, …), how popular it is but also that it appears along with R2-D2 often, and it has a robotic voice. All this information can be relevant. Which one to pick may affect the performance of the system a lot.

For a recommendation system, the co-occurrence information might work the best, but for a visual search system, the picture might be the most relevant.

3.2 Encode the items

3.2.1 Content

Items can be encoded based on their content. Clothes can be well represented with images. Sounds are identified by their audio content. News can be understood using their text. Deep learning models are very good at producing representations of content that have good retrieval properties. Images

Images can be encoded with ResNet

Images can be represented with embeddings (read an introduction on that in my previous blogpost). Networks like ResNet or EfficientNet are really good feature extractors for images, and many pre-trained networks are available.

It is also not only possible to represent the whole image, but it’s also possible to use segmentation or object detection before applying the image encoder.

  • Segmentation can be used to extract part of the image pixel by pixel, it can be relevant to extract shirts and pants from a fashion picture

An important difference in the various image encoders is which loss they are using. Convolutional networks are often trained using triple loss, cross-entropy, or more recently contrastive loss. Each loss can provide different characteristics for embeddings: triple loss and contrastive loss try to put together similar items whereas cross-entropy will put together items of the same class. A lot of pre-trained models are trained on ImageNet with cross-entropy for image classification, but self-supervised learning (simclr byol) is quickly changing this to make unsupervised training without classification possible. In the near future, the best encoders may not need labeled data. This video tutorial from CVPR2020 is really good to go into detail about image retrieval.

Being able to encode images as vectors makes it possible to build many applications: anything that can be seen and watched is something that can be encoded. Fashion visual search, plant search, product search are all possible. Searching in movies and other video content also becomes possible. Text

Text can also be represented with embeddings. Text can be in various forms and lengths: words, sentences, documents. Modern deep learning can now represent most of those in powerful representation.

  • Word2vec: encoding words is one of the most popular forms of embeddings. From the context of words in documents, it’s possible to infer which words are closer in meaning to others. Illustrated word2vec and word2vec explained introduce the concept and methods well

In practice, some good encoders for text can be:

  • Glove word embeddings. This small example word-knn repo I built can help to start quickly

Being able to encode text as vectors make it possible to search for articles, movie description, book titles, paragraphs of Wikipedia, … A lot of content is available as text, so using that information for a retrieval system might be one of the first steps to try. And all the other content

Beyond text and image, audio content can also be encoded as embeddings (think of applications like Shazam). Jina examples and vectorhub provide a lot of good examples on how to encode embeddings using content

Jina examples: a collection of end to end retrieval systems VectorHub: a collection of models to encode data into embedding Scaling it up

In order to encode not only a few hundred but billions of embeddings, batching jobs such as spark or Pyspark can be really helpful. Most image and text models will be able to encode thousands of samples per second. Encoding a billion samples in an hour would require about 300 executors.

Encoding items by their content works well and scales to billions of items. But the content is not the only available data, let’s see how else items can be encoded.

3.2.2 Distribution: trends

Items such as clothes, movies and news are often present in websites visited by many users. The users interact with the items, like or dislike them, some items are popular, some items are seen only by parts of the users, and related items are often seen together. All of that is interaction data. This interaction data can be used to encode items. This is particularly useful to define embeddings for which the distance metric is based on how people interact with these items without needing any information about their content. SVD

A first method to build such behavior embedding is SVD: Singular value decomposition.

In the context when for a set of items (news, products, restaurants,…) user ratings are available, it is possible to compute user and item embeddings. The first step is to compute a user-item similarity matrix and using matrix factorization (SVD), user embeddings and item embeddings are computed. User item svd is a simple introduction to this process.

Another setting appears when it’s possible to observe co-occurrences between items. An example could be products (clothes, houses, laptops, …) that are viewed or are bought together by a user. These co-occurrences can be expressed with their PMI and this item-item matrix can be factored with SVD into item embeddings. These two blog posts provide a good introduction.

The SVD algorithm can be scaled up to sparse matrices of billions of lines and columns thanks to an efficient spark RSVD implementation

This way to encode items into embeddings is particularly powerful to encode user preferences and user behavior regarding items without needing any knowledge about these items. It means it can work across languages and for items where no content is available. Graph embeddings

A second way to encode items using their distribution is graph embeddings.

Many datasets can be represented as graphs. A good example is a knowledge graph. Wikidata and DBpedia for example represent knowledge in the real world as entities such as people, companies, countries, movies,… and relations between them such as a spouse, president, nationality, actor.

Wikidata represents knowledge about the world entities
Wikidata represents knowledge about the world entities

This forms a graph: entities are nodes in the graph, and these nodes are linked by edges which are relations.

There are many interesting algorithms and recent papers on graph embeddings and graph neural networks in general (this telegram channel is great to follow the topic), but a simple and scalable one is Pytorch Big Graph. This helper I built with a colleague can help build large graph datasets for PBG and view some knn results.

This representation of data as a graph can be used to build embeddings for nodes and transformation for edges that make it possible to go from one node to the other. The idea is to learn how to map a node to another node using both node embeddings and a learnable transformation for the edge. Such a transformation can be a translation. This gives surprisingly good results to predict the next edge.

PBG makes it possible to learn transformation between billion of embeddings

Pytorch Big Graph’s contribution is to partition the nodes and edges so that it’s possible to apply this learning for hundreds of millions of nodes and billions of edges.

Graphs are very versatile and can not only represent knowledge graphs but also links between users, products, restaurants, movies, … Using graph embeddings can be a good way to use the distribution of items to encode them.

3.2.3 Composition and multimodal

We now have embeddings of items from various perspectives, and they may offer complementary information. How a piece of clothes looks like, how users interact with it, and how it is described may all be relevant.

How can these embeddings be combined into a single one?

  • Concatenation: concatenating the embeddings is a basic method that works surprisingly well. For example, concatenating text and image embeddings makes it possible to search for an item using either its text either its image, or both.
ImageBert cross modal model
ImageBert cross modal model

Composition is a powerful tool and can be used to have a complete view of the items to encode.

3.2.4 Popularity

An important topic to consider in retrieval and recommendation systems is the popularity of items. Showing unpopular items often results in non-relevant results.

A simple way to address this issue is adding a popularity term to the embedding. For example, the last component can be the inverse of the log of the number of views of that item. That way the L2 distance between a query with a 0 in the popularity component will rank first the most popular items. This may help remove some of the unpopular items from the top results, but this isn’t perfect as the trade-off between similarity and popularity must be set manually.

Training the embeddings for a particular objective is a better way to solve this.

3.2.5 Training

To encode items, pre-trained content models and distribution-based methods work well, but to get embeddings tuned for a given task, the best way is to train a new model for this.

Many tasks can be considered to train embeddings: content-based, distribution-based, and for more specific objectives such as engagement, clicks, or maybe even user happiness. Image specific training

Image embeddings can be trained with tasks such as classification, identification, segmentation. Groknet is a good example of a large system to learn image embeddings with specific objectives… It learns on many disparate datasets for many different tasks.

Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses.
Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses.

FaceNet is another simple way to train image embeddings beyond classification. Triple loss allows it to learn a specific kind of image embeddings: face embeddings. This can be reused for other examples such as training bear embeddings Text specific training

Bert is a great example of a model that can be fine-tuned and reused for various objectives. In particular huggingface transformers library and the sentence transformers library based on it are great to fine-tune a text model for a specific use case. Hundreds of different transformer architectures are available there with dozens of training setups. Distribution: recommendation specific training

Fine-tuning embeddings for recommendation
Fine-tuning embeddings for recommendation

Another setup is a training model for recommendation. This can work really well to go beyond SVD and train product and user embeddings. The criteo deepr library with its accompanying blogpost is a good introduction on this topic. Tensorflow recommenders is another good entry point. And so much more

Beyond these specific training settings, training embeddings is the core of deep learning and representation learning. It can be applied in many environments and for many objectives. Some interesting examples are:

  • StarSpace a facebook project to learn embeddings from images, text, graph, and distribution for various objectives

To go more in-depth into this topic of training neural networks for information retrieval, these slides from ecir2018 are pretty complete.

3.3 Build relevant indices

Once embeddings are built, we need a way to look among them quickly. This can be achieved thanks to the k nearest neighbor algorithm. The simple version of it consists of computing a distance between one vector and all the vectors of the dataset. This can be improved greatly by using approximate k nearest neighbors algorithms.

Using the right implementation of knn indices, it’s possible to look for the nearest neighbors of an embedding from a set of billions of vectors in milliseconds. Thanks to quantization techniques, this can fit in only a few GB of memory.

3.3.1 What to index, how many indices?

In terms of performance the important things to optimize for when building indices are:

  • Latency: how much time does it take for an index to return results?

In terms of search relevance, it might be important to partition the embeddings in the right dimensions. For example, items can be partitioned by broad categories (pants, T-shirts,…) for a fashion visual search app. This partitioning can have consequences when building the indices, so it’s better to decide early.

3.3.2 Approximate knn and libraries

To choose the best way to build indices, the number of embeddings is a good discriminator.

Some examples of appropriate algorithms can be:

  • For less than a thousand embeddings, a brute force search makes sense

The faiss documentation provides a good explanation about these trade-offs.

Some algorithms to compute approximate knn are:

  • A naive knn: that can be implemented in O(nlog(k)) with a priority queue or O(n) with quickselect or introselect. To be able to compute this, it’s needed to store all the embeddings in memory.

To learn more about all the kinds of indices, I recommend reading this page of the faiss documentation. This tutorial from CVPR2020 goes into depth about these algorithms, I advise watching it if you’re interested to understand the finer details.

Libraries that implement these indices include:

  • Faiss A very broad library that implements many algorithms and clean interfaces to build them and search from them

As approximate knn is at the core of modern retrieval, it is an active research field. Notably, these recent papers introduce new methods that beat some metrics.

  • Scann from Google is a new method that is state of the art, beating HNSW in speed and recall using anisotropic quantization

I advise to start from faiss for its flexibility and try other libraries for specific needs.

3.3.3 Scaling

To be able to scale to many embeddings, core techniques are:

  • Quantization: embeddings can be compressed into indices of size 1/100 and more

To scale in terms of speed, the speed of the index is really important (algorithms like HNSW can help a lot), but serving is also crucial. More details on that in the serving section.

In practical terms, it’s possible to build an index of 200M embeddings with only 15GB of ram and latencies in milliseconds. This unlocks cheap retrieval systems at the scale of a single server. It also means that at the scale of a few millions of embeddings, knn indices can fit in only hundreds of megabytes of memory, which can fit in desktop machines and even mobile devices.

3.3.4 Knn indices as a database component

Databases exist in all kinds: relational databases, key/value stores, graph databases, document stores,… Each kind has many implementations. These databases bring convenient and efficient ways to store information and look into it. Most of these databases provide ways to add new information and query it over the network and use APIs in many languages. These databases at their core are using indices to make it fast to query them. Relational databases at their core use storage engines (such as InnoDB) which themselves use adapted indices. Key/value stores implement shared and distributed hash-based indices.

What if knn indices could be integrated into database implementations as just one more kind of index?

This is what is proposed by projects such as

  • An elastic search integration of HNSW: they propose to add hnsw as part of the general elastic search database. That makes it possible to combine knn search with strict queries, text queries, and joins provided by elastic search

Beyond these specific systems, what is I think really interesting in this concept, is the emphasis on making building knn indices a simple process that can be retriggered easily:

  • The addition of new data automatically triggers re-indexing, or directly adding embeddings to the existing indices

4. Search pipeline: from indices to semantic search systems

The search pipeline is the part of the system that usually runs in an online and low latency setup. Its goal is to retrieve relevant results for a given query. It’s important that it returns results in seconds or milliseconds depending on the constraints and to take low amounts of memory.

It is composed of a way to extract relevant data from a query, an encoder to transform that data to embeddings, a search system that uses indices built in the encoding pipeline, and finally a post-filtering system that will select the best results.

It can run on servers, but for a smaller amount of items (millions), it can also run directly on the client-side (browsers and small devices).

4.1 Extract queries parts

The first part of the system consists in taking a query as input and extracting relevant data from it to be able to encode it as query embeddings.

Some interesting examples of queries include looking for similar clothes, looking for a plant from a photo, looking for similar songs from an audio record. Another example can be a list of items seen by users.

The query can take any shape: an image, text, a sequence of items, audio, …

To be able to encode it in the best way, various techniques can be used:

  • For an image segmentation or object detection can be relevant: extracting only clothes from the photo of a person for example

Segmentation of people to extract clothes

4.2 Encode the query

Once relevant data is extracted from the query, each of those elements can be encoded. The way to encode it is usually similar to the way the embeddings from the indices are built, but it’s possible to apply techniques that are only relevant for the query.

For example:

  • An average of several elements to get results that are relevant for a list of items

For the use case of recommendation, it’s possible to train directly a model that will produce the best queries for a given objective, see this blogpost from criteo as an example.

4.3 Search the right indices

Depending on the kind of query, it might be relevant to build not only one index but several. For example, if the query has a filter part for a given category of item, it can make sense to build a specific index for this subset of embeddings.

Here selecting the Toyota index made it possible to return only relevant products from this brand.

4.4 Post filtering

Building several indices is one way to introduce strict filtering in a system, but another way is to do a large knn query and post-filter the results.

This can be relevant to avoid building too many indices

4.5 Serving

Finally, building a serving application makes it possible to expose the features to users or other systems. Thanks to fast approximate k nearest neighbors libraries, it is possible to have latencies in milliseconds and thousands of queries per second.

There are many options to build this. Depending on the state and scope of the projects, different technologies make sense:

  • To experiment initial, building a simple flask application with faiss can be done in as few as 20 lines of code

4.6 Evaluation

The evaluation of a semantic search system will depend a lot on the actual use case: a recommender system or an information retrieval system might have very different metrics. Metrics can be broadly split into two categories: online metrics and offline metrics. Online metrics can be measured only from the usage of the system, often in an A/B test setting. For recommendation, in particular, the click-through rate or directly the revenue can be considered, this document explains some of them more in detail. Offline metrics can be computed from offline datasets, and require some labels. These labels can be either implicit based on how users interact with the system (did the user click on this result?), or explicit (annotators providing labels). Some offline metrics are general to all retrieval systems, the Wikipedia page on this is pretty complete. Often used metrics include the recall which measures the number of relevant documents that are retrieved and the discounted cumulative gain which accounts for the rank of the retrieved items.

Before doing quantitative analysis, building a visualization tool and looking at the result often provide useful insights.

5. Practical solutions to build this easily

5.1 End to end open source solutions

Another way to start building semantic search applications is to use pre-existing open source solutions. Recently several organizations and people have built them. They vary in goals, some of them are specific to one modality, some of them only handle the knn part, and a few try to implement everything in a semantic search system. Let’s go over them.

Jina is an end to end semantic search open-source project built by the company of the same name. It’s not a single service but instead provides good APIs in python to define how to create encoders and indexers, and a YAML configuration system to define encoding and searching flows. It proposes to encapsulate each part of the system in docker containers. Dozens of encoders are already available, and several indexers are also built within its system. It also provides tutorials and examples on how to build specific semantic search systems.

I recommend reading jina big blog post. Its yaml ‘syntax and flexibility is what makes it the most interesting and powerful.

Milvus is a semantic search service focused on indexing, using faiss and nmslib. It provides features such as filtering and adding new items on the fly. The encoding part is mostly left for the users to provide. By integrating several aknn libraries, it tries to be efficient.

Reading Milvus scenario and Milvus blog can be good places to start.

Elastic search is a classical indexing database, often used for indexing categories and text. It now has a hnsw integration which provides automatic indexing and using all other strict indices of elastic search. If latencies in seconds are acceptable, this can be a good choice.

Vectorhub provides many encoders (image, audio, text, …) and an easy to use python module to retrieve them. It has rich documentation on building semantic systems and this can be a good starting place to explore encoders and learn more about semantic systems.

Haystack is an end to end system for question answering that uses knn for paragraph semantic indexing. It integrates with many text models (huggingface transformers, DPR, …) and several indexers to provide a fully-featured and flexible question answering pipeline. This can serve as a good example of a modality-specific (text question answering) semantic search system.

These projects are great to start on this topic, but they all have drawbacks. It can be in terms of scalability, flexibility, or technology choice. To go beyond exploration and small projects and build a larger scale or custom projects it is often useful to create custom systems using the building blocks mentioned here.

5.2 From scratch

Writing a semantic search system can seem to be a huge task due to all the different parts that are needed. In practice, the versatility and ease of use of libraries for both encoding and indexing make it possible to create an end to end system in a few lines of code. The image embeddings repo I built can be a simple way to start building a system from scratch. You can also check the small word knn I built as a simple example. The PBG helper I built with a colleague can also help bootstrap the use of graph embeddings. This video from CVPR2020 is another good tutorial to start on this.

From all the components I presented in this post, many are optional: a simple system only needs an encoder, an indexer, and a simple serving service.

Writing a system from scratch can be useful for learning about it, for experimentation, but also to integrate such a system into an existing production environment. Targeting the cloud can be a good option, see this tutorial from google cloud. It is also possible to build this kind of system in any kind of production system.

6. Conclusion

6.1 Beyond search: representation learning

software 2.0
software 2.0

Beyond building semantic search systems, retrieval systems and embeddings are part of the wider field of representation learning. Representation learning is a new way to build software, sometimes called software 2.0. Embeddings are the core parts of deep neural networks: they represent data as vectors in many layers to eventually predict new information.

Representation learning provides embeddings for retrieval and semantic search but in some cases, retrieval can also help representation learning:

  • Using retrieval as part of training: instead of pre-generating negative examples (for a system using a triple loss for example), a retrieval system can be used directly in training (this can be done for example with the integration between faiss and PyTorch)

6.2 What’s next?

As we saw in this post, retrieval systems are easy to build and really powerful, I encourage you to play with them and think about how they could be used for many applications and environments.

Semantic search and retrieval are active research areas and many new things will appear in the next years:

  • New encoders: 3D content is quickly being developed, both with libraries such as PyTorch 3d and impressive papers like PIFuHD

Thank you for reading until this point, and I hope you now also think that semantic search is amazing!



Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Romain Beaumont

Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)