Semantic search with embeddings: index anything

  1. Motivation: why search, why semantic search?
    1.1 What is search?
    1.2 What is semantic search and what can be built with it?
  2. Overall design: how to build semantic search?
  3. Encoding pipeline: from items to vectors and indices
    3.1 Extract relevant data
    3.2 Encode the items
    3.2.1 Content
    3.2.2 Distribution
    3.2.3 Composition
    3.2.4 Popularity
    3.2.5 Training
    3.3 Build relevant indices
    3.3.1 What to index, how many indices?
    3.3.2 Approximate knn and libraries
    3.3.3 Scaling
    3.3.4 Knn indices as a database component
  4. Search pipeline: from indices to semantic search systems
    4.1 Extract queries parts
    4.2 Encode the query
    4.3 Search the right indices
    4.4 Post filtering
    4.5 Serving
    4.6 Evaluation
  5. Practical solutions to build this easily
    5.1 End to end open source solutions
    5.2 From scratch
  6. Conclusion
    6.1 Beyond search: representation learning
    6.2 What’s next?

1. Motivation: why search, why semantic search?

1.1 What is search?

For thousands of years people have wanted to search amongst documents: think about the huge libraries containing millions of books. It was possible then to look in these books thanks to people carefully sorting them by author names, publication date,… Indices of the books were carefully built and it was possible to find books by asking experts.

1.2 What is semantic search and what can be built with it?

The main difference between classical search and semantic search is to use small vectors to represent items.

Visual search can be used to look for plants PlantNet, but also to look for clothes
  • A text search system takes as input a text query and returns results: searching for clothes, for songs, for news
  • A visual search system takes as input an image and returns similar items compared to this picture.
  • A recommendation system takes as input some context, user information and returns similar items optimizing for a given objective: recommending movies, cars, houses
  • Social networks, advertisement networks, specialized search engines (product search) all use retrieval techniques to provide the best information

2. Overall design: how to build semantic search?

3 Encoding pipeline: from items to vectors and indices

The first step to build a semantic retrieval system is to encode items into small vectors (hundreds of dimensions). This is possible for many items and can then be used to index them and search amongst them efficiently.

3.1 Extract relevant data

Retrieval systems can encode items from many different aspects, so it’s important to think about what to encode. Some examples of items to encode are clothes, toys, animals, news, songs, movies, recipes. Each of them has different characteristics: they can be expressed by how they look like, how they can be described, how they appear amongst other items.

Star Wars character C-3PO
This Star Wars character C-3PO can be encoded with a picture of it, a description, how it appears in a graph (it’s in a star wars movie, appearing at the date of these movies, …), how popular it is but also that it appears along with R2-D2 often, and it has a robotic voice. All this information can be relevant. Which one to pick may affect the performance of the system a lot.

3.2 Encode the items

3.2.1 Content

Items can be encoded based on their content. Clothes can be well represented with images. Sounds are identified by their audio content. News can be understood using their text. Deep learning models are very good at producing representations of content that have good retrieval properties.

3.2.1.1 Images

Images can be encoded with ResNet
  • Segmentation can be used to extract part of the image pixel by pixel, it can be relevant to extract shirts and pants from a fashion picture
  • Detection is useful to extract rectangular zones from the images

3.2.1.2 Text

Text can also be represented with embeddings. Text can be in various forms and lengths: words, sentences, documents. Modern deep learning can now represent most of those in powerful representation.

  • Word2vec: encoding words is one of the most popular forms of embeddings. From the context of words in documents, it’s possible to infer which words are closer in meaning to others. Illustrated word2vec and word2vec explained introduce the concept and methods well
  • Transformers are a newer method that makes it possible to encode whole sentences by taking better into account the dependencies between many words in sentences. A few years ago they were able to become state of the art in many tasks. The illustrated transformer is an interesting introduction to them.
  • Bert architecture: finally the Bert architecture is a special kind of transformer that can be trained well in a multitask setting. It was called the ImageNet moment of NLP. illustrated bert is a good introduction to it.
BERT
BERT
  • Glove word embeddings. This small example word-knn repo I built can help to start quickly
  • The labse model for sentence embeddings is a pre-trained bert model which can encode embeddings from as many as 109 languages in a single space
  • document embeddings can be represented as the average of sentences.

3.2.1.3 And all the other content

Beyond text and image, audio content can also be encoded as embeddings (think of applications like Shazam). Jina examples and vectorhub provide a lot of good examples on how to encode embeddings using content

Jina examples: a collection of end to end retrieval systems VectorHub: a collection of models to encode data into embedding

3.2.1.4 Scaling it up

In order to encode not only a few hundred but billions of embeddings, batching jobs such as spark or Pyspark can be really helpful. Most image and text models will be able to encode thousands of samples per second. Encoding a billion samples in an hour would require about 300 executors.

3.2.2 Distribution: trends

Items such as clothes, movies and news are often present in websites visited by many users. The users interact with the items, like or dislike them, some items are popular, some items are seen only by parts of the users, and related items are often seen together. All of that is interaction data. This interaction data can be used to encode items. This is particularly useful to define embeddings for which the distance metric is based on how people interact with these items without needing any information about their content.

3.2.2.1 SVD

A first method to build such behavior embedding is SVD: Singular value decomposition.

3.2.2.2 Graph embeddings

A second way to encode items using their distribution is graph embeddings.

Wikidata represents knowledge about the world entities
Wikidata represents knowledge about the world entities
PBG makes it possible to learn transformation between billion of embeddings

3.2.3 Composition and multimodal

We now have embeddings of items from various perspectives, and they may offer complementary information. How a piece of clothes looks like, how users interact with it, and how it is described may all be relevant.

  • Concatenation: concatenating the embeddings is a basic method that works surprisingly well. For example, concatenating text and image embeddings makes it possible to search for an item using either its text either its image, or both.
  • Multimodal model: vision and language deep learning is becoming really popular, and many models (imagebert, vilbert, uniter, vl-bert, see this interesting demo) propose to learn from both language and text, to produce cross model representations.
  • Fine-tuning a network for a specific task using several embeddings
ImageBert cross modal model
ImageBert cross modal model

3.2.4 Popularity

An important topic to consider in retrieval and recommendation systems is the popularity of items. Showing unpopular items often results in non-relevant results.

3.2.5 Training

To encode items, pre-trained content models and distribution-based methods work well, but to get embeddings tuned for a given task, the best way is to train a new model for this.

3.2.5.1 Image specific training

Image embeddings can be trained with tasks such as classification, identification, segmentation. Groknet is a good example of a large system to learn image embeddings with specific objectives… It learns on many disparate datasets for many different tasks.

Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses.
Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses.

3.2.5.2 Text specific training

Bert is a great example of a model that can be fine-tuned and reused for various objectives. In particular huggingface transformers library and the sentence transformers library based on it are great to fine-tune a text model for a specific use case. Hundreds of different transformer architectures are available there with dozens of training setups.

3.2.5.3 Distribution: recommendation specific training

Fine-tuning embeddings for recommendation
Fine-tuning embeddings for recommendation

3.2.5.1 And so much more

Beyond these specific training settings, training embeddings is the core of deep learning and representation learning. It can be applied in many environments and for many objectives. Some interesting examples are:

  • StarSpace a facebook project to learn embeddings from images, text, graph, and distribution for various objectives
  • Recipe embeddings an example of a project to learn recipe embeddings from ingredients, instructions, and image

3.3 Build relevant indices

Once embeddings are built, we need a way to look among them quickly. This can be achieved thanks to the k nearest neighbor algorithm. The simple version of it consists of computing a distance between one vector and all the vectors of the dataset. This can be improved greatly by using approximate k nearest neighbors algorithms.

3.3.1 What to index, how many indices?

In terms of performance the important things to optimize for when building indices are:

  • Latency: how much time does it take for an index to return results?
  • Recall: how many of the results of a brute force knn are found in the index result?
  • Memory: how big is the index, how much memory is needed to keep it in ram?

3.3.2 Approximate knn and libraries

To choose the best way to build indices, the number of embeddings is a good discriminator.

  • For less than a thousand embeddings, a brute force search makes sense
  • For less than a million, a fast but not memory-efficient algorithm (such as HNSW) is appropriate
  • For less than a billion, quantization (using k-means and IVF) becomes important
  • For a trillion example, the only solution becomes on-disk indices
  • A naive knn: that can be implemented in O(nlog(k)) with a priority queue or O(n) with quickselect or introselect. To be able to compute this, it’s needed to store all the embeddings in memory.
  • HNSW: an algorithm that builds a graph of neighbors. It’s O(log(N)) in search but is not exact. It takes about twice the memory of the embeddings because it needs to store the graph
  • IVF: inverted file algorithm consists of splitting the embeddings space into several parts and using k-means to find an approximation of embeddings. It’s less fast than HNSW but it allows to decrease the memory required by the index as much as needed.
  • Faiss A very broad library that implements many algorithms and clean interfaces to build them and search from them
  • Hnswlib is currently the fastest implementation of HNSW. Highly specialized and optimized
  • Annoy is another knn algorithm, implemented by Spotify
  • Scann from Google is a new method that is state of the art, beating HNSW in speed and recall using anisotropic quantization
  • Catalyzer from Facebook that proposes to train the quantizer with a neural network for a specific task

3.3.3 Scaling

To be able to scale to many embeddings, core techniques are:

  • Quantization: embeddings can be compressed into indices of size 1/100 and more
  • Sharding: partitioning the items along a dimension, makes it possible to store the indices in different machines

3.3.4 Knn indices as a database component

Databases exist in all kinds: relational databases, key/value stores, graph databases, document stores,… Each kind has many implementations. These databases bring convenient and efficient ways to store information and look into it. Most of these databases provide ways to add new information and query it over the network and use APIs in many languages. These databases at their core are using indices to make it fast to query them. Relational databases at their core use storage engines (such as InnoDB) which themselves use adapted indices. Key/value stores implement shared and distributed hash-based indices.

  • An elastic search integration of HNSW: they propose to add hnsw as part of the general elastic search database. That makes it possible to combine knn search with strict queries, text queries, and joins provided by elastic search
  • Unicorn a private system from Facebook that enables integrating knn search in a graph database. As a consequence, queries in that graph can have parts using knn queries.
  • The addition of new data automatically triggers re-indexing, or directly adding embeddings to the existing indices
  • Automatically choosing the right kind of knn index based on the specific constraint of the system

4. Search pipeline: from indices to semantic search systems

The search pipeline is the part of the system that usually runs in an online and low latency setup. Its goal is to retrieve relevant results for a given query. It’s important that it returns results in seconds or milliseconds depending on the constraints and to take low amounts of memory.

4.1 Extract queries parts

The first part of the system consists in taking a query as input and extracting relevant data from it to be able to encode it as query embeddings.

  • For an image segmentation or object detection can be relevant: extracting only clothes from the photo of a person for example
  • For text, it can be relevant to extract named entities from the query, it could make sense to apply query extension on it to add relevant terms or fix typo
  • For a list of items, clustering the items to select only a relevant subset could help

4.2 Encode the query

Once relevant data is extracted from the query, each of those elements can be encoded. The way to encode it is usually similar to the way the embeddings from the indices are built, but it’s possible to apply techniques that are only relevant for the query.

  • An average of several elements to get results that are relevant for a list of items
  • Cluster the points of the query and pick a cluster as the query
  • Use more complex models to generate appropriate queries in the same space, using transformer models for question answering (see DPR), or transformations from graph embeddings (see PBG) for example

4.3 Search the right indices

Depending on the kind of query, it might be relevant to build not only one index but several. For example, if the query has a filter part for a given category of item, it can make sense to build a specific index for this subset of embeddings.

4.4 Post filtering

Building several indices is one way to introduce strict filtering in a system, but another way is to do a large knn query and post-filter the results.

4.5 Serving

Finally, building a serving application makes it possible to expose the features to users or other systems. Thanks to fast approximate k nearest neighbors libraries, it is possible to have latencies in milliseconds and thousands of queries per second.

  • To experiment initial, building a simple flask application with faiss can be done in as few as 20 lines of code
  • Using a proper server with flask like gunicorn with gevent can be enough to reach milliseconds latencies at thousands of qps
  • To get even more performance, building a serving service with native languages like rust or C++ can be done. The benefit of using a native language for this kind of application can be to avoid GC costs, as the knn index itself is itself built in C++ only the serving code needs to be optimized.
  • Aknn libraries are most often built in c++, but bindings can be done with many languages (java, python, c#) manually or with swig. For integration with an existing application, this can be the most relevant in some cases.

4.6 Evaluation

The evaluation of a semantic search system will depend a lot on the actual use case: a recommender system or an information retrieval system might have very different metrics. Metrics can be broadly split into two categories: online metrics and offline metrics. Online metrics can be measured only from the usage of the system, often in an A/B test setting. For recommendation, in particular, the click-through rate or directly the revenue can be considered, this document explains some of them more in detail. Offline metrics can be computed from offline datasets, and require some labels. These labels can be either implicit based on how users interact with the system (did the user click on this result?), or explicit (annotators providing labels). Some offline metrics are general to all retrieval systems, the Wikipedia page on this is pretty complete. Often used metrics include the recall which measures the number of relevant documents that are retrieved and the discounted cumulative gain which accounts for the rank of the retrieved items.

5. Practical solutions to build this easily

5.1 End to end open source solutions

Another way to start building semantic search applications is to use pre-existing open source solutions. Recently several organizations and people have built them. They vary in goals, some of them are specific to one modality, some of them only handle the knn part, and a few try to implement everything in a semantic search system. Let’s go over them.

5.2 From scratch

Writing a semantic search system can seem to be a huge task due to all the different parts that are needed. In practice, the versatility and ease of use of libraries for both encoding and indexing make it possible to create an end to end system in a few lines of code. The image embeddings repo I built can be a simple way to start building a system from scratch. You can also check the small word knn I built as a simple example. The PBG helper I built with a colleague can also help bootstrap the use of graph embeddings. This video from CVPR2020 is another good tutorial to start on this.

6. Conclusion

6.1 Beyond search: representation learning

software 2.0
software 2.0
  • Using retrieval as part of training: instead of pre-generating negative examples (for a system using a triple loss for example), a retrieval system can be used directly in training (this can be done for example with the integration between faiss and PyTorch)
  • Using retrieval as a way to build datasets: similar examples can be retrieved as part of a data augmentation pipeline

6.2 What’s next?

As we saw in this post, retrieval systems are easy to build and really powerful, I encourage you to play with them and think about how they could be used for many applications and environments.

  • New encoders: 3D content is quickly being developed, both with libraries such as PyTorch 3d and impressive papers like PIFuHD
  • Quantization and indexing is also improving quickly with papers such as Scann and Catalyzer
  • End to end training and multi-modal representation are progressing fast with vision and language having a lot of progress: towards a generalized way to build any representation?
  • Where can retrieval systems live? Up until now, they were mostly localized in servers, but with the progress of aknn and quantization, what applications can they unlock in user devices?
  • Semantic search systems are progressing fast too: hundred of companies are building them, and several good open-source systems are starting to emerge

--

--

Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Romain Beaumont

Romain Beaumont

Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)