Semantic search at billions scale

  • Img2dataset: distributed downloading of 5B samples into 240TB of images
  • Clip inference: distributed inference of 240TB of images into 18TB of embeddings
  • Autofaiss : distributed indexing of 9TB of embeddings into 800GB of index
  • Clip back : serving a semantic search engine using a single 16GB of ram machine


Img2dataset is a tool that is able to download and resize images from urls very fast.

  • Input: url and text files (800GB)
  • Output: tar files containing images and text (240TB)
  • Splitting the input into shards, save those to disk with
  • Using a generic distributor implementation to support both pyspark and multiprocessing
  • A downloader that takes a shard of url and metadata, downloads it and produces an image shard
  • A writer to save each output shard with various formats
  • Pyspark to distribute the task in 10 nodes
  • Wandb for metric tracking
  • Fsspec for saving to/from any file system
  • Webdataset/pyarrow/pandas to handle reading and writing
  • Computing exif for billions of items eventually reach some weird edge cases, but you can just fix it
  • Building pex automatically on release using github actions makes it easy to distribute the code to many machines
  • There are so many ways to resize images, thanks ross wightman for helping improve their support
  • Supporting tfrecord can be helpful for reading in jax and tensorflow, thanks boris dayman for this
  • Using output files with a zero padding numbering format makes it convenient to sort the files and to list them
  • Fsspec was really useful to support writing on s3, local or gcs

Clip inference

Clip inference is a tool that takes image and text data and produces image and text embeddings.

  • Input: tar files containing images and text (240TB)
  • Output: numpy and parquet files (18TB)
  • A generic distributor to handle both pyspark and sequential strategies and uses the pyspark 3 handling of resources to pick the right gpu
  • A runner which calls the reader the mapper and the writer inside an executor
  • The sampler which decides which files to read from the input
  • The reader which uses a sampler to read part of the data as webdataset
  • The mapper which takes an image shard and output an embedding shard
  • The writer takes an embedding shard and writes it to the file system using fsspec
  • The logger which logs metrics in all executors using a folder in the output file system and wandb
  • The main which build the config for all of those and start the inference


Autofaiss is a tool that takes clip embeddings and produces knn indices.

  • Input: numpy files (9TB)
  • Output: indices (800GB)
  • Split the embeddings collection in N parts: for example a 9TB embedding set is transformer into 100 parts of 90GB
  • Add each part to a pretrained index, that leaves us with 100 indices parts
  • Merge each of these N parts back to M index, for example 10 indices
  • The distributed on disk faiss guide is really good to explain the various options for distributed indexing
  • Using fsspec here again allowed to support all filesystems, eg s3 or hdfs

Clip back

Clip back allows to plug the index and metadata and search among results.

  • Input: indices and metadata (1.6TB)
  • Output: a rest api taking image or test and returning similar url and text
  • In order to map the result from id to metadata, arrow is good for memmapping see ArrowMetadataProvider and pyarrow ipc
  • Merge on disk of faiss allows merging 800GB of index at no ram usage
  • Faiss allows to load an index with memory mapping, and hence use a 800GB index at no ram usage
  • classify an image into safe/unsafe
  • Deduplicate results by computing dot products and keeping only one sample per connected components
  • External sort works well on hdd ; but however spark sorting works much better on ssd nvme
  • Memory mapping for kv works much better on ssd nvme
  • On disk knn are much faster on ssd nvme

Limitations and possible improvements

There are some clear limitations on what was built here:

  • All the content is static: real search engines need to update the content regularly
  • Only batch processing: combining real time / stream processing and batch processing allows a much higher reactivity while keeping costs low
  • Semantic and strict filtering are not combined
  • Using tools such as apache beam to combine batch and streaming processing
  • Using a constantly updated source such as a custom crawler or following commoncrawl updates
  • Improving faiss by handling one stage filtering (see this excellent post by pinecone on the topic search filtering )


Thanks for reading!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Romain Beaumont

Romain Beaumont


Machine learning engineer interested in representation learning, computer vision, natural language processing and programming (distributed systems, algorithms)