Binary Passage Retrieval
BPR support in Weaviate is under development.
Overview
Binary Passage Retrieval, or learning-to-hash, is the idea of producing a lightweight encoding (or hash) of each vector and using the hash instead of the vector. The term Binary Passage Retrieval (or short BPR) originates from a 2021 research paper by Yamada et al. It is also called learning-to-hash as the models used for BPR are trained with a combined loss function that looks at both cosine/dot distance between the regular vectors, as well as the Hamming distance between the binary hash representations of the vectors.
Since vectors are typically held in memory for optimal performance, reducing the size of a large vector can greatly reduce the overall memory requirements for running Weaviate with larger datasets.
Up to 32x memory reduction
The above example shows a reduction factor of 32x which is based on the original resource paper. Instead of searching on the full vectors, we search on a hash representing the vector. In this example, a hash is built by representing the sign of the original dimension of the vector. Each dimension would previously occupy a float32
variable which is 4 Bytes or 32 bits large. Since the sign can only ever be positive or negative, we only need two possible options, i.e. a single bit for each dimension. Thus, the memory footprint is reduced 32-fold.
No accuracy-loss thanks to two-step retrieval
Since only the sign is kept in the hash, two vectors with different magnitudes but the same sign on each dimension produce identical hashes. The original vectors could however have vastly different positions in the vectors space. For example, take the two vectors [0.13, 0.97]
and [0.87, 0.02]
. Since all dimensions are positive the hashes would be identical. However, the cosine distance between both vectors is non-zero.
To overcome this potential accuracy loss with BPR, the original paper suggests a two-step ranking process. The hashes are used to produce suitable candidates. For each candidate, the original vectors can then be used to determine the correct order. For example, to retrieve the top 50 objects, we would query 1,000 candidates and then re-rank them according to their cosine distance to the query vector. Although this may seem like additional work, the process is still very efficient, because:
- Calculating the Hamming distances on the hashes is considerably cheaper than calculating the cosine distances, so the initial lookup is cheaper than a pure cosine-based search. (This holds true regardless of whether a vector-index is used or not, see more later on.)
- The number of candidates is small enough (e.g. 1,000), so that a brute-force re-ranking based on the original vectors requires a negligible amount of compute time.
BRP in Weaviate
BPR support in Weaviate is currently under development.
Weaviate's BPR integration aims to provide the best possible user experience. Since BPR can be mostly seen as an optimization, our view is to support native BPR integration which can be toggled on or off with a simple flag. BPR-supported Weaviate modules can advertise their BPR capabilities and default to BPR if desired. For users of Weaviate text2vec-*
modules, using BPR within Weaviate will feel exactly the same way. For users importing their own vectors, a simple flag will activate BPR mode if your custom module is compatible with BPR.
BPR-enabled Weaviate text2vec-modules
BPR makes use of specially trained neural models. Instead of optimizing for a cosine or dot-product-based loss, BPR-enabled models optimize for a combined cosine/dot and binary loss. As a result, only neural-network based models (e.g. those with text2vec-transformers
) can be used with BPR. BPR will not be available with the text2vec-contextionary
module.
BPR-enabled out-of-the-box models
Long-term, we aim to provide BPR support for the most commonly used text2vec-transformers
models (typically Sentence-BERT models). When launching the BPR feature, at least one popular general-purpose model will be supported.
BPR FAQ
What vector distances are used with BPR?
To generate candidates (step 1), a Hamming distance is used on the binary hashes. To re-rank the candidates and produce the final result set (step 2), the distance metric is the same as without BRP. Typically this is cosine distance or dot product on the raw vector representation.
Do we still need a (HNSW or similar) vector index with BPR?
The benefits of an efficient Approximate Nearest Neighbor (ANN) vector index still apply to BPR. Since hashes are much more compact and calculating the Hamming distance is considerably cheaper than calculating cosine distance, the point where brute-force is still possible is raised considerably. So for some cases, you may consider skipping vector-indexing entirely and relying on a flat search. However, when optimizing either for low latency or high throughput, a vector index will produce considerably better results. We therefore expect BPR to be used with our existing indices (currently HNSW) in most cases.
Will my memory requirements go down by a factor of 32?
No. The 32x is the theoretical reduction of the vectors themselves. However, the vector index also requires memory which grows with the dataset size. The great news is that since calculating the Hamming distance on the binary hashes is much faster, we can offload a lot of the computation to the query time without sacrificing latencies. As a result, we can build a "weaker" (binary) vector index which will consume less memory. As of now, we are still running experiments to figure out the final memory requirements with BPR. The requirements will be drastically lower albeit not by a factor of 32.
Which media types can be used with BPR?
There is no limitation to media types. As long as the models used can be (re-)trained to include a binary loss, any model can be used with BPR. We will roll out support for text-based models first, but other media types (image, CLIP, etc.) can also benefit from BPR support.
The name "binary passage retrieval" implies text passages, but the technique used is not in any way specific to text-based media.
Can BPR be used with custom vectors or only with Weaviate modules?
BPR support will be provided for all uses-cases including your own vectors from your custom modules outside of Weaviate. However, in this case, it is your responsibility to make sure that the model provides vectors which can be hashed without too much accuracy loss. Typically this is done by including the binary loss as part of the loss function during the training process.
Can I use a non-BPR-specific model in BPR mode?
From a purely technical perspective, there is no impediment to using any model with BPR. Since the hashing function is very simple, any existing vector can be hashed. However, it is unlikely that your hashes will hold a lot of semantic meaning if the model has not been specifically trained to produce good hashes alongside good vectors. Typically this is done by including the binary loss as part of the loss function during the training process.