Semantic search with embeddings: index anything

25 min readDec 1, 2020

Building scalable semantic retrieval from image, text, graph, and interaction data

We live in a world with an explosion of information. There are millions of clothes, songs, movies, recipes, cars, houses, which one should you pick? Semantic search can find the right one for any taste and wish!

In this article, I will introduce what is semantic search, what can be built with it, and how to build it. For example, why do people look for clothes? They like the brand, the color, the shape, or the price. All these aspects can be used to find the best one.

The color and shape can be found using the image, and the price and brand are found in trends.

Images and trends can be represented as small vectors called embeddings. Embeddings are the core of semantic search: once items are encoded as vectors, it’s fast and efficient to search for the best items.

I will explain how semantic search works: encoding items as embeddings, indexing them, and using these indices for fast search in order to build semantic systems.

Motivation: why search, why semantic search?
1.1 What is search?
1.2 What is semantic search and what can be built with it?
Overall design: how to build semantic search?
Encoding pipeline: from items to vectors and indices
3.1 Extract relevant data
3.2 Encode the items
3.2.1 Content
3.2.2 Distribution
3.2.3 Composition
3.2.4 Popularity
3.2.5 Training
3.3 Build relevant indices
3.3.1 What to index, how many indices?
3.3.2 Approximate knn and libraries
3.3.3 Scaling
3.3.4 Knn indices as a database component
Search pipeline: from indices to semantic search systems
4.1 Extract queries parts
4.2 Encode the query
4.3 Search the right indices
4.4 Post filtering
4.5 Serving
4.6 Evaluation
Practical solutions to build this easily
5.1 End to end open source solutions
5.2 From scratch
Conclusion
6.1 Beyond search: representation learning
6.2 What’s next?

1. Motivation: why search, why semantic search?

1.1 What is search?

For thousands of years people have wanted to search amongst documents: think about the huge libraries containing millions of books. It was possible then to look in these books thanks to people carefully sorting them by author names, publication date,… Indices of the books were carefully built and it was possible to find books by asking experts.

30 years ago the internet became popular and with it, the number of documents to search from went from millions to billions. The speed at which these documents went from a few thousand every year to thousands every day, it wasn’t possible anymore to index everything by hand.

That’s when efficient retrieval systems were built. Using appropriate data structure it is possible to index billions of documents every day automatically and query them in milliseconds.

Search is about fulfilling an information need. Starting from using a query of any shape (questions, list of items, images, text documents,…), the search system provides a list of relevant items. Classical search systems build simple representations from text, image, and context and build efficient indices to search from them. Some descriptors and techniques include

Page rank to rank web pages
Tf idf to rank words in documents
Color histogram and Surf are simple image descriptors
Inverted index to index content efficiently
Crawling to find pages in the web
Item item similarities is a classic method to find similar items using rating and trends

Although these systems can be scaled up to very large amounts of content, they often suffer from difficulties to handle the meaning of the content and tend to stay at the surface level.

These classical retrieval techniques provide solid foundations for many services and applications. However, they cannot fully understand the content they are indexing and as such cannot answer in a relevant manner some queries over some documents. We will see in the next sections how embeddings can help.

1.2 What is semantic search and what can be built with it?

The main difference between classical search and semantic search is to use small vectors to represent items.

Visual search can be used to look for plants PlantNet, but also to look for clothes

Using embeddings is powerful: it can be used to build systems that can help users find items they like (music, product, videos, recipes, …) using many kinds of queries. It can not only work in explicit search systems (inputting a query in a search bar) but also in implicit ones (relevant products in retailer websites, personalized news in publishers, interesting posts on social platforms).

Many kinds of systems can be built on top of search.

A text search system takes as input a text query and returns results: searching for clothes, for songs, for news
A visual search system takes as input an image and returns similar items compared to this picture.
A recommendation system takes as input some context, user information and returns similar items optimizing for a given objective: recommending movies, cars, houses
Social networks, advertisement networks, specialized search engines (product search) all use retrieval techniques to provide the best information

It is possible to search for a variety of items, anything that has pictures, text, audio, or is available in a context. Popular examples of such systems are Google lens, Amazon recommendation, or newer ones for fashion search, plant search, …

On a smaller scale, it can be interesting to index your pictures, your messages, to find a tv show among many, to find actors in a tv show, …

2. Overall design: how to build semantic search?

A semantic search system is composed of two parts: an encoding pipeline that builds indices, and a search pipeline that lets the user use these indices to search for items.

3 Encoding pipeline: from items to vectors and indices

The first step to build a semantic retrieval system is to encode items into small vectors (hundreds of dimensions). This is possible for many items and can then be used to index them and search amongst them efficiently.

3.1 Extract relevant data

Retrieval systems can encode items from many different aspects, so it’s important to think about what to encode. Some examples of items to encode are clothes, toys, animals, news, songs, movies, recipes. Each of them has different characteristics: they can be expressed by how they look like, how they can be described, how they appear amongst other items.

All this information can be encoded as embeddings. A different axis to think about is how many items to encode? Are all the items unique or does it make more sense to group them by relevant characteristics? Are there some items that are more relevant and should be a priority? Making that choice early can have dramatic consequences on the rest of the system.

This Star Wars character C-3PO can be encoded with a picture of it, a description, how it appears in a graph (it’s in a star wars movie, appearing at the date of these movies, …), how popular it is but also that it appears along with R2-D2 often, and it has a robotic voice. All this information can be relevant. Which one to pick may affect the performance of the system a lot.

For a recommendation system, the co-occurrence information might work the best, but for a visual search system, the picture might be the most relevant.

3.2 Encode the items

3.2.1 Content

Items can be encoded based on their content. Clothes can be well represented with images. Sounds are identified by their audio content. News can be understood using their text. Deep learning models are very good at producing representations of content that have good retrieval properties.

3.2.1.1 Images

Images can be represented with embeddings (read an introduction on that in my previous blogpost). Networks like ResNet or EfficientNet are really good feature extractors for images, and many pre-trained networks are available.

It is also not only possible to represent the whole image, but it’s also possible to use segmentation or object detection before applying the image encoder.

Segmentation can be used to extract part of the image pixel by pixel, it can be relevant to extract shirts and pants from a fashion picture
Detection is useful to extract rectangular zones from the images

An important difference in the various image encoders is which loss they are using. Convolutional networks are often trained using triple loss, cross-entropy, or more recently contrastive loss. Each loss can provide different characteristics for embeddings: triple loss and contrastive loss try to put together similar items whereas cross-entropy will put together items of the same class. A lot of pre-trained models are trained on ImageNet with cross-entropy for image classification, but self-supervised learning (simclr byol) is quickly changing this to make unsupervised training without classification possible. In the near future, the best encoders may not need labeled data. This video tutorial from CVPR2020 is really good to go into detail about image retrieval.

Being able to encode images as vectors makes it possible to build many applications: anything that can be seen and watched is something that can be encoded. Fashion visual search, plant search, product search are all possible. Searching in movies and other video content also becomes possible.

3.2.1.2 Text

Text can also be represented with embeddings. Text can be in various forms and lengths: words, sentences, documents. Modern deep learning can now represent most of those in powerful representation.

Word2vec: encoding words is one of the most popular forms of embeddings. From the context of words in documents, it’s possible to infer which words are closer in meaning to others. Illustrated word2vec and word2vec explained introduce the concept and methods well
Transformers are a newer method that makes it possible to encode whole sentences by taking better into account the dependencies between many words in sentences. A few years ago they were able to become state of the art in many tasks. The illustrated transformer is an interesting introduction to them.
Bert architecture: finally the Bert architecture is a special kind of transformer that can be trained well in a multitask setting. It was called the ImageNet moment of NLP. illustrated bert is a good introduction to it.

In practice, some good encoders for text can be:

Glove word embeddings. This small example word-knn repo I built can help to start quickly
The labse model for sentence embeddings is a pre-trained bert model which can encode embeddings from as many as 109 languages in a single space
document embeddings can be represented as the average of sentences.

Being able to encode text as vectors make it possible to search for articles, movie description, book titles, paragraphs of Wikipedia, … A lot of content is available as text, so using that information for a retrieval system might be one of the first steps to try.

3.2.1.3 And all the other content

Beyond text and image, audio content can also be encoded as embeddings (think of applications like Shazam). Jina examples and vectorhub provide a lot of good examples on how to encode embeddings using content

Jina examples: a collection of end to end retrieval systems VectorHub: a collection of models to encode data into embedding

3.2.1.4 Scaling it up

In order to encode not only a few hundred but billions of embeddings, batching jobs such as spark or Pyspark can be really helpful. Most image and text models will be able to encode thousands of samples per second. Encoding a billion samples in an hour would require about 300 executors.

Encoding items by their content works well and scales to billions of items. But the content is not the only available data, let’s see how else items can be encoded.

3.2.2 Distribution: trends

Items such as clothes, movies and news are often present in websites visited by many users. The users interact with the items, like or dislike them, some items are popular, some items are seen only by parts of the users, and related items are often seen together. All of that is interaction data. This interaction data can be used to encode items. This is particularly useful to define embeddings for which the distance metric is based on how people interact with these items without needing any information about their content.

3.2.2.1 SVD

A first method to build such behavior embedding is SVD: Singular value decomposition.

In the context when for a set of items (news, products, restaurants,…) user ratings are available, it is possible to compute user and item embeddings. The first step is to compute a user-item similarity matrix and using matrix factorization (SVD), user embeddings and item embeddings are computed. User item svd is a simple introduction to this process.

Another setting appears when it’s possible to observe co-occurrences between items. An example could be products (clothes, houses, laptops, …) that are viewed or are bought together by a user. These co-occurrences can be expressed with their PMI and this item-item matrix can be factored with SVD into item embeddings. These two blog posts provide a good introduction.

The SVD algorithm can be scaled up to sparse matrices of billions of lines and columns thanks to an efficient spark RSVD implementation

This way to encode items into embeddings is particularly powerful to encode user preferences and user behavior regarding items without needing any knowledge about these items. It means it can work across languages and for items where no content is available.

3.2.2.2 Graph embeddings

A second way to encode items using their distribution is graph embeddings.

Many datasets can be represented as graphs. A good example is a knowledge graph. Wikidata and DBpedia for example represent knowledge in the real world as entities such as people, companies, countries, movies,… and relations between them such as a spouse, president, nationality, actor.

Wikidata represents knowledge about the world entities

This forms a graph: entities are nodes in the graph, and these nodes are linked by edges which are relations.

There are many interesting algorithms and recent papers on graph embeddings and graph neural networks in general (this telegram channel is great to follow the topic), but a simple and scalable one is Pytorch Big Graph. This helper I built with a colleague can help build large graph datasets for PBG and view some knn results.

This representation of data as a graph can be used to build embeddings for nodes and transformation for edges that make it possible to go from one node to the other. The idea is to learn how to map a node to another node using both node embeddings and a learnable transformation for the edge. Such a transformation can be a translation. This gives surprisingly good results to predict the next edge.

PBG makes it possible to learn transformation between billion of embeddings

Pytorch Big Graph’s contribution is to partition the nodes and edges so that it’s possible to apply this learning for hundreds of millions of nodes and billions of edges.

Graphs are very versatile and can not only represent knowledge graphs but also links between users, products, restaurants, movies, … Using graph embeddings can be a good way to use the distribution of items to encode them.

3.2.3 Composition and multimodal

We now have embeddings of items from various perspectives, and they may offer complementary information. How a piece of clothes looks like, how users interact with it, and how it is described may all be relevant.

How can these embeddings be combined into a single one?

Concatenation: concatenating the embeddings is a basic method that works surprisingly well. For example, concatenating text and image embeddings makes it possible to search for an item using either its text either its image, or both.
Multimodal model: vision and language deep learning is becoming really popular, and many models (imagebert, vilbert, uniter, vl-bert, see this interesting demo) propose to learn from both language and text, to produce cross model representations.
Fine-tuning a network for a specific task using several embeddings

Composition is a powerful tool and can be used to have a complete view of the items to encode.

3.2.4 Popularity

An important topic to consider in retrieval and recommendation systems is the popularity of items. Showing unpopular items often results in non-relevant results.

A simple way to address this issue is adding a popularity term to the embedding. For example, the last component can be the inverse of the log of the number of views of that item. That way the L2 distance between a query with a 0 in the popularity component will rank first the most popular items. This may help remove some of the unpopular items from the top results, but this isn’t perfect as the trade-off between similarity and popularity must be set manually.

Training the embeddings for a particular objective is a better way to solve this.

3.2.5 Training

To encode items, pre-trained content models and distribution-based methods work well, but to get embeddings tuned for a given task, the best way is to train a new model for this.

Many tasks can be considered to train embeddings: content-based, distribution-based, and for more specific objectives such as engagement, clicks, or maybe even user happiness.

3.2.5.1 Image specific training

Image embeddings can be trained with tasks such as classification, identification, segmentation. Groknet is a good example of a large system to learn image embeddings with specific objectives… It learns on many disparate datasets for many different tasks.

Groknet: using a vision trunk to train embeddings with many kinds of datasets and losses. — **Groknet**: using a vision trunk to train embeddings with many kinds of datasets and losses.

FaceNet is another simple way to train image embeddings beyond classification. Triple loss allows it to learn a specific kind of image embeddings: face embeddings. This can be reused for other examples such as training bear embeddings

3.2.5.2 Text specific training

Bert is a great example of a model that can be fine-tuned and reused for various objectives. In particular huggingface transformers library and the sentence transformers library based on it are great to fine-tune a text model for a specific use case. Hundreds of different transformer architectures are available there with dozens of training setups.

3.2.5.3 Distribution: recommendation specific training

Fine-tuning embeddings for recommendation

Another setup is a training model for recommendation. This can work really well to go beyond SVD and train product and user embeddings. The criteo deepr library with its accompanying blogpost is a good introduction on this topic. Tensorflow recommenders is another good entry point.

3.2.5.1 And so much more

Beyond these specific training settings, training embeddings is the core of deep learning and representation learning. It can be applied in many environments and for many objectives. Some interesting examples are:

StarSpace a facebook project to learn embeddings from images, text, graph, and distribution for various objectives
Recipe embeddings an example of a project to learn recipe embeddings from ingredients, instructions, and image

To go more in-depth into this topic of training neural networks for information retrieval, these slides from ecir2018 are pretty complete.

3.3 Build relevant indices

Once embeddings are built, we need a way to look among them quickly. This can be achieved thanks to the k nearest neighbor algorithm. The simple version of it consists of computing a distance between one vector and all the vectors of the dataset. This can be improved greatly by using approximate k nearest neighbors algorithms.

Using the right implementation of knn indices, it’s possible to look for the nearest neighbors of an embedding from a set of billions of vectors in milliseconds. Thanks to quantization techniques, this can fit in only a few GB of memory.

3.3.1 What to index, how many indices?

In terms of performance the important things to optimize for when building indices are:

Latency: how much time does it take for an index to return results?
Recall: how many of the results of a brute force knn are found in the index result?
Memory: how big is the index, how much memory is needed to keep it in ram?

In terms of search relevance, it might be important to partition the embeddings in the right dimensions. For example, items can be partitioned by broad categories (pants, T-shirts,…) for a fashion visual search app. This partitioning can have consequences when building the indices, so it’s better to decide early.

3.3.2 Approximate knn and libraries

To choose the best way to build indices, the number of embeddings is a good discriminator.

Some examples of appropriate algorithms can be:

For less than a thousand embeddings, a brute force search makes sense
For less than a million, a fast but not memory-efficient algorithm (such as HNSW) is appropriate
For less than a billion, quantization (using k-means and IVF) becomes important
For a trillion example, the only solution becomes on-disk indices

The faiss documentation provides a good explanation about these trade-offs.

Some algorithms to compute approximate knn are:

A naive knn: that can be implemented in O(nlog(k)) with a priority queue or O(n) with quickselect or introselect. To be able to compute this, it’s needed to store all the embeddings in memory.
HNSW: an algorithm that builds a graph of neighbors. It’s O(log(N)) in search but is not exact. It takes about twice the memory of the embeddings because it needs to store the graph
IVF: inverted file algorithm consists of splitting the embeddings space into several parts and using k-means to find an approximation of embeddings. It’s less fast than HNSW but it allows to decrease the memory required by the index as much as needed.

To learn more about all the kinds of indices, I recommend reading this page of the faiss documentation. This tutorial from CVPR2020 goes into depth about these algorithms, I advise watching it if you’re interested to understand the finer details.

Libraries that implement these indices include:

Faiss A very broad library that implements many algorithms and clean interfaces to build them and search from them
Hnswlib is currently the fastest implementation of HNSW. Highly specialized and optimized
Annoy is another knn algorithm, implemented by Spotify

As approximate knn is at the core of modern retrieval, it is an active research field. Notably, these recent papers introduce new methods that beat some metrics.

Scann from Google is a new method that is state of the art, beating HNSW in speed and recall using anisotropic quantization
Catalyzer from Facebook that proposes to train the quantizer with a neural network for a specific task

I advise to start from faiss for its flexibility and try other libraries for specific needs.

3.3.3 Scaling

To be able to scale to many embeddings, core techniques are:

Quantization: embeddings can be compressed into indices of size 1/100 and more
Sharding: partitioning the items along a dimension, makes it possible to store the indices in different machines

To scale in terms of speed, the speed of the index is really important (algorithms like HNSW can help a lot), but serving is also crucial. More details on that in the serving section.

In practical terms, it’s possible to build an index of 200M embeddings with only 15GB of ram and latencies in milliseconds. This unlocks cheap retrieval systems at the scale of a single server. It also means that at the scale of a few millions of embeddings, knn indices can fit in only hundreds of megabytes of memory, which can fit in desktop machines and even mobile devices.

3.3.4 Knn indices as a database component

Databases exist in all kinds: relational databases, key/value stores, graph databases, document stores,… Each kind has many implementations. These databases bring convenient and efficient ways to store information and look into it. Most of these databases provide ways to add new information and query it over the network and use APIs in many languages. These databases at their core are using indices to make it fast to query them. Relational databases at their core use storage engines (such as InnoDB) which themselves use adapted indices. Key/value stores implement shared and distributed hash-based indices.

What if knn indices could be integrated into database implementations as just one more kind of index?

This is what is proposed by projects such as

An elastic search integration of HNSW: they propose to add hnsw as part of the general elastic search database. That makes it possible to combine knn search with strict queries, text queries, and joins provided by elastic search
Unicorn a private system from Facebook that enables integrating knn search in a graph database. As a consequence, queries in that graph can have parts using knn queries.

Beyond these specific systems, what is I think really interesting in this concept, is the emphasis on making building knn indices a simple process that can be retriggered easily:

The addition of new data automatically triggers re-indexing, or directly adding embeddings to the existing indices
Automatically choosing the right kind of knn index based on the specific constraint of the system

4. Search pipeline: from indices to semantic search systems

The search pipeline is the part of the system that usually runs in an online and low latency setup. Its goal is to retrieve relevant results for a given query. It’s important that it returns results in seconds or milliseconds depending on the constraints and to take low amounts of memory.

It is composed of a way to extract relevant data from a query, an encoder to transform that data to embeddings, a search system that uses indices built in the encoding pipeline, and finally a post-filtering system that will select the best results.

It can run on servers, but for a smaller amount of items (millions), it can also run directly on the client-side (browsers and small devices).

4.1 Extract queries parts

The first part of the system consists in taking a query as input and extracting relevant data from it to be able to encode it as query embeddings.

Some interesting examples of queries include looking for similar clothes, looking for a plant from a photo, looking for similar songs from an audio record. Another example can be a list of items seen by users.

The query can take any shape: an image, text, a sequence of items, audio, …

To be able to encode it in the best way, various techniques can be used:

For an image segmentation or object detection can be relevant: extracting only clothes from the photo of a person for example
For text, it can be relevant to extract named entities from the query, it could make sense to apply query extension on it to add relevant terms or fix typo
For a list of items, clustering the items to select only a relevant subset could help

Segmentation of people to extract clothes

4.2 Encode the query

Once relevant data is extracted from the query, each of those elements can be encoded. The way to encode it is usually similar to the way the embeddings from the indices are built, but it’s possible to apply techniques that are only relevant for the query.

For example:

An average of several elements to get results that are relevant for a list of items
Cluster the points of the query and pick a cluster as the query
Use more complex models to generate appropriate queries in the same space, using transformer models for question answering (see DPR), or transformations from graph embeddings (see PBG) for example

For the use case of recommendation, it’s possible to train directly a model that will produce the best queries for a given objective, see this blogpost from criteo as an example.

4.3 Search the right indices

Depending on the kind of query, it might be relevant to build not only one index but several. For example, if the query has a filter part for a given category of item, it can make sense to build a specific index for this subset of embeddings.

Here selecting the Toyota index made it possible to return only relevant products from this brand.

4.4 Post filtering

Building several indices is one way to introduce strict filtering in a system, but another way is to do a large knn query and post-filter the results.

This can be relevant to avoid building too many indices

4.5 Serving

Finally, building a serving application makes it possible to expose the features to users or other systems. Thanks to fast approximate k nearest neighbors libraries, it is possible to have latencies in milliseconds and thousands of queries per second.

There are many options to build this. Depending on the state and scope of the projects, different technologies make sense:

To experiment initial, building a simple flask application with faiss can be done in as few as 20 lines of code
Using a proper server with flask like gunicorn with gevent can be enough to reach milliseconds latencies at thousands of qps
To get even more performance, building a serving service with native languages like rust or C++ can be done. The benefit of using a native language for this kind of application can be to avoid GC costs, as the knn index itself is itself built in C++ only the serving code needs to be optimized.
Aknn libraries are most often built in c++, but bindings can be done with many languages (java, python, c#) manually or with swig. For integration with an existing application, this can be the most relevant in some cases.

4.6 Evaluation

The evaluation of a semantic search system will depend a lot on the actual use case: a recommender system or an information retrieval system might have very different metrics. Metrics can be broadly split into two categories: online metrics and offline metrics. Online metrics can be measured only from the usage of the system, often in an A/B test setting. For recommendation, in particular, the click-through rate or directly the revenue can be considered, this document explains some of them more in detail. Offline metrics can be computed from offline datasets, and require some labels. These labels can be either implicit based on how users interact with the system (did the user click on this result?), or explicit (annotators providing labels). Some offline metrics are general to all retrieval systems, the Wikipedia page on this is pretty complete. Often used metrics include the recall which measures the number of relevant documents that are retrieved and the discounted cumulative gain which accounts for the rank of the retrieved items.

Before doing quantitative analysis, building a visualization tool and looking at the result often provide useful insights.

5. Practical solutions to build this easily

5.1 End to end open source solutions

Another way to start building semantic search applications is to use pre-existing open source solutions. Recently several organizations and people have built them. They vary in goals, some of them are specific to one modality, some of them only handle the knn part, and a few try to implement everything in a semantic search system. Let’s go over them.

Jina is an end to end semantic search open-source project built by the company of the same name. It’s not a single service but instead provides good APIs in python to define how to create encoders and indexers, and a YAML configuration system to define encoding and searching flows. It proposes to encapsulate each part of the system in docker containers. Dozens of encoders are already available, and several indexers are also built within its system. It also provides tutorials and examples on how to build specific semantic search systems.

I recommend reading jina big blog post. Its yaml ‘syntax and flexibility is what makes it the most interesting and powerful.

Milvus is a semantic search service focused on indexing, using faiss and nmslib. It provides features such as filtering and adding new items on the fly. The encoding part is mostly left for the users to provide. By integrating several aknn libraries, it tries to be efficient.

Reading Milvus scenario and Milvus blog can be good places to start.

Elastic search is a classical indexing database, often used for indexing categories and text. It now has a hnsw integration which provides automatic indexing and using all other strict indices of elastic search. If latencies in seconds are acceptable, this can be a good choice.

Vectorhub provides many encoders (image, audio, text, …) and an easy to use python module to retrieve them. It has rich documentation on building semantic systems and this can be a good starting place to explore encoders and learn more about semantic systems.

Haystack is an end to end system for question answering that uses knn for paragraph semantic indexing. It integrates with many text models (huggingface transformers, DPR, …) and several indexers to provide a fully-featured and flexible question answering pipeline. This can serve as a good example of a modality-specific (text question answering) semantic search system.

These projects are great to start on this topic, but they all have drawbacks. It can be in terms of scalability, flexibility, or technology choice. To go beyond exploration and small projects and build a larger scale or custom projects it is often useful to create custom systems using the building blocks mentioned here.

5.2 From scratch

Writing a semantic search system can seem to be a huge task due to all the different parts that are needed. In practice, the versatility and ease of use of libraries for both encoding and indexing make it possible to create an end to end system in a few lines of code. The image embeddings repo I built can be a simple way to start building a system from scratch. You can also check the small word knn I built as a simple example. The PBG helper I built with a colleague can also help bootstrap the use of graph embeddings. This video from CVPR2020 is another good tutorial to start on this.

From all the components I presented in this post, many are optional: a simple system only needs an encoder, an indexer, and a simple serving service.

Writing a system from scratch can be useful for learning about it, for experimentation, but also to integrate such a system into an existing production environment. Targeting the cloud can be a good option, see this tutorial from google cloud. It is also possible to build this kind of system in any kind of production system.

6. Conclusion

6.1 Beyond search: representation learning

Beyond building semantic search systems, retrieval systems and embeddings are part of the wider field of representation learning. Representation learning is a new way to build software, sometimes called software 2.0. Embeddings are the core parts of deep neural networks: they represent data as vectors in many layers to eventually predict new information.

Representation learning provides embeddings for retrieval and semantic search but in some cases, retrieval can also help representation learning:

Using retrieval as part of training: instead of pre-generating negative examples (for a system using a triple loss for example), a retrieval system can be used directly in training (this can be done for example with the integration between faiss and PyTorch)
Using retrieval as a way to build datasets: similar examples can be retrieved as part of a data augmentation pipeline

6.2 What’s next?

As we saw in this post, retrieval systems are easy to build and really powerful, I encourage you to play with them and think about how they could be used for many applications and environments.

Semantic search and retrieval are active research areas and many new things will appear in the next years:

New encoders: 3D content is quickly being developed, both with libraries such as PyTorch 3d and impressive papers like PIFuHD
Quantization and indexing is also improving quickly with papers such as Scann and Catalyzer
End to end training and multi-modal representation are progressing fast with vision and language having a lot of progress: towards a generalized way to build any representation?
Where can retrieval systems live? Up until now, they were mostly localized in servers, but with the progress of aknn and quantization, what applications can they unlock in user devices?
Semantic search systems are progressing fast too: hundred of companies are building them, and several good open-source systems are starting to emerge

Thank you for reading until this point, and I hope you now also think that semantic search is amazing!