Consolidation of the existing reviews of Vector DBs

Pradeep Bansal
5 min readJun 20, 2023

--

Semantic search is a powerful technology that allows for more accurate and contextually relevant search results by understanding the meaning behind words and phrases. One of the critical components enabling semantic search is the efficient storage and retrieval of vector embeddings, which represent the semantic features of textual data. This is possible by using vector databases to store as well as retrieve the vector embeddings using efficient algorithms like Approximate Nearest Neighbors(ANN). There are multiple options available when it comes to vector databases for which we decided to check the existing review blogs and articles of all these vector databases. There are a lot of reviews about the vector databases which we have consolidated here. Let's get started.

Look at the vector database market

The article here provides a table of multiple vector databases including PineCone, Weaviate, MarqoAI, Qdrant, Chroma, Vespa, Vald and Milvus. The table doesn’t give technical details except whether it is open source, developing language, a brief summary, indexing type, consistency level and other details.

Vector Library vs Vector Database

The article first differentiates between the vector libraries including FAISS, ScaNN, ANNOY, NMSLIB and HNSWLIB and the Vector Databases like Weaviate or Pinecone. Vector databases and vector libraries are used to efficiently search through these vectors. Vector libraries store vector embeddings and are suitable for static data, while vector databases can store and update both the vectors and the associated objects, making them suitable for dynamic data. Vector databases offer features like CRUD support, real-time search, and structured filters, while vector libraries are faster and more focused on in-memory similarity search. Weaviate is an open-source vector database that combines the speed of ANN algorithms with database features.

Key differences between vector libraries and vector databases

  • Vector libraries store vector embeddings only, while vector databases can store both vectors and the associated objects.
  • Vector libraries have immutable indexes, while vector databases allow for updates and modifications to the index.
  • Most vector libraries require importing all data objects before building the index, while vector databases allow querying and modifying data during the import process.
  • Vector databases offer features like filtering, CRUD support, real-time-search, persistence, replication, backups, and sharding, which are not available in vector libraries.
  • Vector databases provide SDKs and language clients, while vector libraries often have Python bindings.
  • Vector databases have a deployment ecosystem (Docker, K8s, Helm, SaaS) and support multi-tenancy, while vector libraries require building these features.

Use cases

  • Vector libraries are suitable for applications with static data, such as academic information retrieval benchmarks.
  • Vector databases are ideal for applications with constantly changing data, such as e-commerce recommendations, image search, and semantic similarity.

Weaviate

  • Weaviate is an open-source vector database that combines ANN algorithm speed with database features like backups, real-time queries, persistence, and replication.
  • Weaviate supports GraphQL, REST, and client libraries in multiple programming languages.
  • Weaviate is suitable for various use cases, and it recently introduced a new feature for representing a user’s interests through cross-references.
  • Weaviate provides a feature comparison with vector libraries, highlighting the differences in filtering, updating capability, incremental importing, object and vector storage, speed, performance optimization, durability, persistence, sharding, replication, backups, deployment ecosystem, SDKs, multi-tenancy, and module ecosystem.

Benchmarking of selected vector databases

The Qdrant benchmarking content discusses benchmarking vector search engines and the creation of an open benchmark to compare their performance. The benchmarks focus on relative numbers, allowing for the comparison of different engines with equal resources. The article also highlights the use of a Python client and the importance of factors such as search precision, speed, and resource requirements. The results show that Qdrant and Milvus perform well in terms of indexing time, while Qdrant consistently achieves high RPS and low latencies. The article also covers the challenges of filtered search and provides insights into different engine performances in various scenarios. The benchmarks are open-source and contributions are welcome. Important to note that the company is definitely biased in benchmarking the databases as they do accept that all the configurations of the other databases were not tried out.

Here are a few things about the benchmarking article by Qdrant:

- The content discusses benchmarking vector search engines and the need for a unified open benchmark in the world of vector databases.
- The benchmarks are performed on the same hardware that can be rented from any cloud provider, ensuring affordability and reproducibility.
- The focus of the benchmarks is on relative numbers, allowing for comparisons of performance across different engines with equal resources.
- The benchmarking list includes upload and search speed on a single node, filtered search benchmark, memory consumption benchmark, and cluster mode benchmark.
- Experiment design decisions are described in the FAQ section, and suggestions for testing variants can be made in the Discord channel.
- The benchmarks cover various configurations of engines and datasets, including different vector dimensionality and distance functions.
- The Python client is used for the benchmarks, considering its popularity and relevance in deep learning applications.
- The results are presented in interactive charts, allowing users to select datasets, search threads, and metrics to compare engine performance.
- Different patterns are observed in filtered search benchmarks, including speed boost, speed downturn, and accuracy collapse.
- Qdrant and Milvus are highlighted as the fastest engines in terms of indexing time, while Qdrant achieves high RPS and low latencies in most scenarios.
- Redis performs well with a single thread but is limited to scalability due to its architectural constraints.
- Elasticsearch is generally slower compared to other engines across different datasets and metrics.
- The complexity of filtering search results and the challenges faced by different engines are discussed.
- The benchmark is open-source, and contributions are welcome to improve accuracy and address potential biases.
- Factors considered in choosing a database include search precision, speed, and resource requirements.
- Hardware selection focuses on using the same machine for all tests, an average machine easily available for renting from cloud providers.
- FAISS and Annoy libraries are not compared as they are not considered suitable for production environments and lack certain features.
- Python clients are used due to the common usage of Python in generating embeddings and interacting with vector databases.
- The benchmark primarily focuses on open-source vector databases to ensure fair comparisons and reproducibility.

The landscape of Vector Databases

The article has a complete presentation showing the brief about the vector databases, their comparison, factors and suggestions when choosing a vector database. The list of vector databases is quite exhaustive and also touches upon the options including AWS OpenSearch and Azure Cognitive Search in one of the slides. There is a 20-minute talk by the author Dmitry Kan as well using the slides below. Dmitry gives pointers on how to select a vector database in the talk as well as this short article here. The main factors while choosing a vector database includes (but not limited to):

  • have an engineering team or not to host the database or need a fully managed database,
  • have the embeddings or need the vector database to generate the embeddings,
  • latency requirements such as batch or online,
  • developer’s experience in the team for the learning curve involved,
  • reliability, cost and security are other factors.
Source- Dmitry Kan

--

--

Pradeep Bansal
Pradeep Bansal

Written by Pradeep Bansal

Staff ML Engineer, MS IISc, Ex-Entrepreneur, ML Consultant, Health Expert https://www.linkedin.com/in/pradeepud/

No responses yet