Hybrid search is a technique that combines multiple search algorithms to improve the accuracy and relevance of search results. It uses the best features of both keyword-based search algorithms with vector search techniques. By leveraging the strengths of different algorithms, it provides a more effective search experience for users.
The Hybrid search feature was introduced in Weaviate 1.17. It uses sparse and dense vectors to represent the semantic meaning and context of search queries and documents.
In this blog post, you will learn about the implementation of hybrid search in Weaviate and how to use it.
Sparse and Dense Vectors
Sparse and dense vectors are calculated with distinct algorithms. Sparse vectors have mostly zero values with only a few non-zero values, while dense vectors mostly contain non-zero values. Sparse embeddings are generated from algorithms like BM25 and SPLADE. Dense embeddings are generated from machine learning models like GloVe and Transformers.
Note, the current implementation of hybrid search in Weaviate uses BM25/BM25F and vector search.
If you’re interested to learn about how dense vector indexes are built and optimized in Weaviate, check out this article.
BM25
BM25 builds on the keyword scoring method TF-IDF (Term-Frequency Inverse-Document Frequency) by taking the Binary Independence Model from the IDF calculation and adding a normalization penalty that weighs a document’s length relative to the average length of all the documents in the database.
The image below presents the scoring calculation of BM25:
The score of the document, query pair is determined by weighing the uniqueness of each keyword in the query relative to the collection of texts. BM25 contains additional static parameters, k1 and b that may help calibrate performance to particular datasets.
BM25F
BM25F was also implemented in Weaviate 1.17
. BM25F is a variant of BM25 that allows multiple text fields per object to be given different weights in the ranking calculation. These weights are important for when fields in a document are more important than others. For example, a title may be given more weight than the abstract, since the title is sometimes more informative and concise. This type of weighting makes BM25F more flexible and customizable than BM25.
Dense Vectors
Dense vectors represent information stored in a database; this includes text, images, and other types of data. These embeddings are generated from machine learning models that convert data to vectors. The vectors are densely packed with information and are mostly made up of non-zero values. The meaning of each value in the vector depends on the machine learning model that you used.
Vector databases, like Weaviate, store these embeddings and calculate the distance between the two vectors. Distance metrics show how similar or dissimilar two vector embeddings are. The search query is converted to a vector, similar to the data vectors, and the distance value determines how close the vectors are.
Hybrid Search Explained
Hybrid search merges dense and sparse vectors together to deliver the best of both search methods. Generally speaking, dense vectors excel at understanding the context of the query, whereas sparse vectors excel at keyword matches. Consider the query, “How to catch an Alaskan Pollock”. The dense vector representation is able to disambiguate “catch” as meaning fishing rather than baseball or sickness. The sparse vector search will match the phrase “Alaskan Pollock” only. This example query shows where hybrid search combines the best of both sparse and dense vectors.
Reciprocal Rank Fusion (RRF)
While researching hybrid search, we needed a way to combine the results of BM25 and dense vector search into a single ranked list. We came across a paper from Benham and Culpepper exploring rank fusion techniques. This paper analyzed seven strategies for combining the ranked results of two lists into a single ranking. We decided to start with the RRF score. The RRF score is calculated by taking the sum of the reciprocal rankings that is given from each list. By putting the rank of the document in the denominator, it penalizes the documents that are ranked lower in the list.
Let's look at an example of this. We have three documents labeled A
, B
, and C
and have run a BM25 and Dense search. In this example, we have set the constant k to 0.
BM25 Ranking | Dense Ranking | Results |
---|---|---|
A | B | A: 1/1 + 1/3 = 1.3 |
B | C | B: 1/2 + 1/1 = 1.5 |
C | A | C: 1/3 + 1/2 = 0.83 |
The above table shows the ranked order of the BM25 and Dense search. To fuse the two lists together, we need to take the sum of the reciprocal ranks. Based on the results, the top document is Document B
with a ranking of 1.5, then Document A
at 1.3, and Document C
at 0.83.
Re-ranking is a necessary step when implementing hybrid search. The alpha parameter dictates the weighting of each algorithm and determines the re-ranking of the results.
For a deep dive into how Weaviate's hybrid search combines sparse and dense vectors, you can read more on these fusion algorithms in our blog post.
When to Use Hybrid Search?
Hybrid search queries are ideal for a search system that wants to leverage the power of semantic search capabilities but still rely on exact keyword matches. For example, the example search query "How to catch an Alaskan Pollock" from before would lead to better results with a hybrid search query than with a common keyword search or even a semantic search query.
A Simple Hybrid Search Pipeline in Weaviate
To use hybrid search in Weaviate, you only need to confirm that you’re using Weaviate v1.17
or a later version. You can run the hybrid queries in GraphQL or the other various client programming languages.
There are five parameters needed to run the hybrid search query (some are optional):
hybrid
: shows that you want to use a hybrid search queryquery
: search queryalpha
(optional): weighting for each search algorithm (alpha = 0 (sparse, keyword search), alpha = 1 (dense, vector search), alpha = 0.5 (equal weight for sparse and dense vectors))vector
(optional): optional to supply your own vectorscore
(optional): additional information on how much the sparse and dense methods contributed to the result
With just a few lines of code, you can start using hybrid search. You can run a test query in the Weaviate console using GraphQL. The query is, “Fisherman that catches salmon” (similar to the example above). When we set the alpha to 0.5 it is equally weighing the dense and sparse vector results.
{
Get {
Article (
hybrid: {
query: "Fisherman that catches salmon"
alpha: 0.5
})
{
title
summary
_additional {score}
}
}
}
Summary
This article explained the concept of hybrid search. In the context of the Weaviate vector database, hybrid search is a combination of keyword search and semantic search. By combining these two types of searches, hybrid search queries can lead to more relevant results in use cases where you want to leverage the power of semantic search but also rely on exact matches of certain keywords, such as product or brand names.
Check out the documentation for more information on hybrid search!
Ready to start building?
Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).
Don't want to miss another blog post?
Sign up for our bi-weekly newsletter to stay updated!
By submitting, I agree to the Terms of Service and Privacy Policy.