User Tools

Site Tools


vec_db

Vector Databases

Our goal is to come-up with robust, scalable solution for general-purpose recommendation that just relies on vector matrix lookups to get top-k candidates.

Right now, the most promising candidate is Elasticsearch. Please refer to its section below.

Why not PostgreSQL?

  • Even with C-extensions we just have a sequential execution. Parallelization might be possible. But we are working with high dimensional data (4096 and more dimensions) and we cannot used any batching, just vector to vector operations
  • Storing floating point data is possible as an array, but no indexing is possible. This means we just can parallelize sequential operations which is not scalable

We have implemented the Hellinger distance as a PostgreSQL extension:

Git: https://git.picalike.corpex-kunden.de/incubator/similarity-postgresql

Wiki: http://dokuwiki.picalike.corpex-kunden.de/psql_sim_db&s%5B%5D=similarity

A commercial solution: https://www.pinecone.io/learn/hnsw/

It is using an indexing strategy described here: https://arxiv.org/abs/1603.09320

Requirements

Solutions like Margo provide only a document interface for the import and that is a problem, since we are using our own models and feature extraction is done outside the service. Thus a service needs to provide this functionality at least:

  • [MUST] ability to provide features explicitly per document, list of floats or numpy / tensor
  • [MAY] using brute force search instead of approximate nearest neighbor, because a metric like cosine makes no sense for our binary 512-bit features, l2 distance is okay
  • [MUST] rich query language that provides and/or/not style searches and also range queries.
  • [SHOULD] since we have multiple categories per product, a “$query_cat in product_cats” query needs to be either directly supported or at least we need to model it somehow

Alternatives

None of those had the necessary features nor the required performance. They're here for documentation purposes.

https://weaviate.io

An open-source vector database (weaviate) with the ability to also apply filtering based on metadata.

https://milvus.io/docs/v2.1.x/install_standalone-docker.md

Milvus

Candidates for evaluation

Solr

With version 9, native “Dense Retrieval” is supported:
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html
Docker
image: https://hub.docker.com/_/solr
hints: https://solr.apache.org/guide/solr/latest/deployment-guide/solr-in-docker.html

Without “cloud” mode, no sharding is supported which is why the docker image is limited. Also since without sharding, the concurrency level seems not optimal.

If you modify a schema, data need to be re-indexed:
https://solr.apache.org/guide/solr/latest/indexing-guide/reindexing.html
which was very painful, since you have also to delete orphaned segments manually.

Elasticsearch

Git: https://git.picalike.corpex-kunden.de/incubator/vector-db

kNN vector search is a newly introduced feature:
https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html#dense-vector-similarity

Issues:

  • Brute-force exact match is too expensive, however HNSW-based search is still in technical preview: “This functionality is in technical preview and may be changed or removed in a future release. Elastic will apply best effort to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.”
  • Use of dot_product to simulate hamming distance isn't possible due to a sanity check: only unit-length vectors are allowed. This happens inside a single function and could be changed, but it requires rebuilding the library and repackaging a Docker container, which might not be straighforward.

Like Solr, it does not support changes to an index (called schema in Solr). A new index must be created and all data reindexed. It's possible to add and update documents, reindexing happens on the background.

Speed: no sharding is required as each search is single-threaded in a shard but each shard can handle concurrent requests using more CPU. Therefore, it's very fast (faster than Solr) even with high concurrency.

The documentation, in fact, suggests avoiding too many shards in favor of larger ones:
https://www.elastic.co/guide/en/elasticsearch/reference/8.4/size-your-shards.html

Speed is also proportional to the total memory available. Elasticsearch uses nearly *all* available memory if unlimited and does clever indexing and caching behind the curtains.

It's recommended to run Elastisearch-only servers with all it's memory dedicated to it. In the “size your shard” guide above, there are general guidelines about required and recommended heap sizes.

We need distances for all filtered products for both shape and color vector. This is possible defining a custom score function:

This way, the vectors are not indexed, and exact distances are calculated at query time. This approach yields very good performance <50ms for small shops and stays <100ms 99% of the time for bigger shops (1M products) if there's enough memory.

Directly access the vector data (dimensions):
https://www.elastic.co/guide/en/elasticsearch/reference/8.4/query-dsl-script-score-query.html#vector-functions

Multi fields which are not used by us, since we only have 'keywords' (non-analyzed) and long for prices, etc:
https://www.elastic.co/guide/en/elasticsearch/reference/8.4/multi-fields.html https://www.elastic.co/guide/en/elasticsearch/reference/8.4/keyword.html

Hints how to get rid of the password, but the certificate is still needed:
https://discuss.elastic.co/t/set-password-and-user-with-docker-compose/225075

Python package:
https://elasticsearch-py.readthedocs.io/en/v8.4.3/api.html

Docker:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

Issues:

  • when using python threadhpool concurrency >10, the first batch n tasks (n = max_concurrency - 10) hang until *all* other queries finish, effectively taking almost the entire runtime of the script to complete. CPU usage is not at 100% nor memory use. Disk IO is pushed to its limits, however if we sleep after the first batch of max_concurrency requests, the second batch hangs. All other timings are still very good.

Multi-node setup docs:

Blog-Post about how Tinder approached complex search requirements:

Libraries

We decided that the most versatile library is indeed faiss that also has support for the index:
https://github.com/facebookresearch/faiss FAISS

vec_db.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1