Semantic search with embeddings: index anything

  1. Motivation: why search, why semantic search?
    1.1 What is search?
    1.2 What is semantic search and what can be built with it?
  2. Overall design: how to build semantic search?
  3. Encoding pipeline: from items to vectors and indices
    3.1 Extract relevant data
    3.2 Encode the items
    3.2.1 Content
    3.2.2 Distribution
    3.2.3 Composition
    3.2.4 Popularity
    3.2.5 Training
    3.3 Build relevant indices
    3.3.1 What to index, how many indices?
    3.3.2 Approximate knn and libraries
    3.3.3 Scaling
    3.3.4 Knn indices as a database component
  4. Search pipeline: from indices to semantic search systems
    4.1 Extract queries parts
    4.2 Encode the query
    4.3 Search the right indices
    4.4 Post filtering
    4.5 Serving
    4.6 Evaluation
  5. Practical solutions to build this easily
    5.1 End to end open source solutions
    5.2 From scratch
  6. Conclusion
    6.1 Beyond search: representation learning
    6.2 What’s next?

1. Motivation: why search, why semantic search?

1.1 What is search?

1.2 What is semantic search and what can be built with it?

Visual search can be used to look for plants PlantNet, but also to look for clothes
  • A text search system takes as input a text query and returns results: searching for clothes, for songs, for news
  • A visual search system takes as input an image and returns similar items compared to this picture.
  • A recommendation system takes as input some context, user information and returns similar items optimizing for a given objective: recommending movies, cars, houses
  • Social networks, advertisement networks, specialized search engines (product search) all use retrieval techniques to provide the best information

2. Overall design: how to build semantic search?

3 Encoding pipeline: from items to vectors and indices

3.1 Extract relevant data

Star Wars character C-3PO
This Star Wars character C-3PO can be encoded with a picture of it, a description, how it appears in a graph (it’s in a star wars movie, appearing at the date of these movies, …), how popular it is but also that it appears along with R2-D2 often, and it has a robotic voice. All this information can be relevant. Which one to pick may affect the performance of the system a lot.

3.2 Encode the items

3.2.1 Content

3.2.1.1 Images

Images can be encoded with ResNet
  • Segmentation can be used to extract part of the image pixel by pixel, it can be relevant to extract shirts and pants from a fashion picture
  • Detection is useful to extract rectangular zones from the images

3.2.1.2 Text

  • Word2vec: encoding words is one of the most popular forms of embeddings. From the context of words in documents, it’s possible to infer which words are closer in meaning to others. Illustrated word2vec and word2vec explained introduce the concept and methods well
  • Transformers are a newer method that makes it possible to encode whole sentences by taking better into account the dependencies between many words in sentences. A few years ago they were able to become state of the art in many tasks. The illustrated transformer is an interesting introduction to them.
  • Bert architecture: finally the Bert architecture is a special kind of transformer that can be trained well in a multitask setting. It was called the ImageNet moment of NLP. illustrated bert is a good introduction to it.
BERT
BERT
  • Glove word embeddings. This small example word-knn repo I built can help to start quickly
  • The labse model for sentence embeddings is a pre-trained bert model which can encode embeddings from as many as 109 languages in a single space
  • document embeddings can be represented as the average of sentences.

3.2.1.3 And all the other content

Jina examples: a collection of end to end retrieval systems VectorHub: a collection of models to encode data into embedding

3.2.1.4 Scaling it up

3.2.2 Distribution: trends

3.2.2.1 SVD

3.2.2.2 Graph embeddings

Wikidata represents knowledge about the world entities
Wikidata represents knowledge about the world entities
PBG makes it possible to learn transformation between billion of embeddings

3.2.3 Composition and multimodal

  • Concatenation: concatenating the embeddings is a basic method that works surprisingly well. For example, concatenating text and image embeddings makes it possible to search for an item using either its text either its image, or both.
  • Multimodal model: vision and language deep learning is becoming really popular, and many models (imagebert, vilbert, uniter, vl-bert, see this interesting demo) propose to learn from both language and text, to produce cross model representations.
  • Fine-tuning a network for a specific task using several embeddings
ImageBert cross modal model
ImageBert cross modal model

3.2.4 Popularity

3.2.5 Training

3.2.5.1 Image specific training

Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses.
Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses.

3.2.5.2 Text specific training

3.2.5.3 Distribution: recommendation specific training

Fine-tuning embeddings for recommendation
Fine-tuning embeddings for recommendation

3.2.5.1 And so much more

  • StarSpace a facebook project to learn embeddings from images, text, graph, and distribution for various objectives
  • Recipe embeddings an example of a project to learn recipe embeddings from ingredients, instructions, and image

3.3 Build relevant indices

3.3.1 What to index, how many indices?

  • Latency: how much time does it take for an index to return results?
  • Recall: how many of the results of a brute force knn are found in the index result?
  • Memory: how big is the index, how much memory is needed to keep it in ram?

3.3.2 Approximate knn and libraries

  • For less than a thousand embeddings, a brute force search makes sense
  • For less than a million, a fast but not memory-efficient algorithm (such as HNSW) is appropriate
  • For less than a billion, quantization (using k-means and IVF) becomes important
  • For a trillion example, the only solution becomes on-disk indices
  • A naive knn: that can be implemented in O(nlog(k)) with a priority queue or O(n) with quickselect or introselect. To be able to compute this, it’s needed to store all the embeddings in memory.
  • HNSW: an algorithm that builds a graph of neighbors. It’s O(log(N)) in search but is not exact. It takes about twice the memory of the embeddings because it needs to store the graph
  • IVF: inverted file algorithm consists of splitting the embeddings space into several parts and using k-means to find an approximation of embeddings. It’s less fast than HNSW but it allows to decrease the memory required by the index as much as needed.
  • Faiss A very broad library that implements many algorithms and clean interfaces to build them and search from them
  • Hnswlib is currently the fastest implementation of HNSW. Highly specialized and optimized
  • Annoy is another knn algorithm, implemented by Spotify
  • Scann from Google is a new method that is state of the art, beating HNSW in speed and recall using anisotropic quantization
  • Catalyzer from Facebook that proposes to train the quantizer with a neural network for a specific task

3.3.3 Scaling

  • Quantization: embeddings can be compressed into indices of size 1/100 and more
  • Sharding: partitioning the items along a dimension, makes it possible to store the indices in different machines

3.3.4 Knn indices as a database component

  • An elastic search integration of HNSW: they propose to add hnsw as part of the general elastic search database. That makes it possible to combine knn search with strict queries, text queries, and joins provided by elastic search
  • Unicorn a private system from Facebook that enables integrating knn search in a graph database. As a consequence, queries in that graph can have parts using knn queries.
  • The addition of new data automatically triggers re-indexing, or directly adding embeddings to the existing indices
  • Automatically choosing the right kind of knn index based on the specific constraint of the system

4. Search pipeline: from indices to semantic search systems

4.1 Extract queries parts

  • For an image segmentation or object detection can be relevant: extracting only clothes from the photo of a person for example
  • For text, it can be relevant to extract named entities from the query, it could make sense to apply query extension on it to add relevant terms or fix typo
  • For a list of items, clustering the items to select only a relevant subset could help

4.2 Encode the query

  • An average of several elements to get results that are relevant for a list of items
  • Cluster the points of the query and pick a cluster as the query
  • Use more complex models to generate appropriate queries in the same space, using transformer models for question answering (see DPR), or transformations from graph embeddings (see PBG) for example

4.3 Search the right indices

4.4 Post filtering

4.5 Serving

  • To experiment initial, building a simple flask application with faiss can be done in as few as 20 lines of code
  • Using a proper server with flask like gunicorn with gevent can be enough to reach milliseconds latencies at thousands of qps
  • To get even more performance, building a serving service with native languages like rust or C++ can be done. The benefit of using a native language for this kind of application can be to avoid GC costs, as the knn index itself is itself built in C++ only the serving code needs to be optimized.
  • Aknn libraries are most often built in c++, but bindings can be done with many languages (java, python, c#) manually or with swig. For integration with an existing application, this can be the most relevant in some cases.

4.6 Evaluation

5. Practical solutions to build this easily

5.1 End to end open source solutions

5.2 From scratch

6. Conclusion

6.1 Beyond search: representation learning

software 2.0
software 2.0
  • Using retrieval as part of training: instead of pre-generating negative examples (for a system using a triple loss for example), a retrieval system can be used directly in training (this can be done for example with the integration between faiss and PyTorch)
  • Using retrieval as a way to build datasets: similar examples can be retrieved as part of a data augmentation pipeline

6.2 What’s next?

  • New encoders: 3D content is quickly being developed, both with libraries such as PyTorch 3d and impressive papers like PIFuHD
  • Quantization and indexing is also improving quickly with papers such as Scann and Catalyzer
  • End to end training and multi-modal representation are progressing fast with vision and language having a lot of progress: towards a generalized way to build any representation?
  • Where can retrieval systems live? Up until now, they were mostly localized in servers, but with the progress of aknn and quantization, what applications can they unlock in user devices?
  • Semantic search systems are progressing fast too: hundred of companies are building them, and several good open-source systems are starting to emerge

--

--

--

Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Romain Beaumont

Romain Beaumont

Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)

More from Medium

Thoughts on objective functions for pre-training sentence embedding

An practical introduction to Diff-Pruning for BERT

A more general modality unspecific representation method: data2vec

Reflections from a workshop on transformers and NLP