← Back to Blogs
Skip to main content

More efficient multi-vector embeddings with MUVERA

· 16 min read
Roberto Esposito

Weaviate 1.31 implements the MUVERA encoding algorithm for multi-vector embeddings. In this blog, we dive the algorithm in detail, including what MUVERA is, how it works, and whether it might make sense for you.

Let's start by reviewing what multi-vector models are, and the challenges that MUVERA looks to solve.

Quick Takeaways
  • MUVERA converts multi-vector embeddings (like ColBERT/ColPali) into single fixed-size vectors, dramatically reducing memory and computational costs
  • On our tests (LoTTE dataset):
    • Memory footprint reduced by ~70%
    • Import times improved from 20+ minutes to 3-6 minutes
  • Key tradeoffs:
    • Some loss in recall quality (can be mitigated by increasing HNSW ef values)
    • Higher ef values reduce query throughput
  • Best suited for:
    • Large-scale deployments where memory costs are significant
    • Use cases that can tolerate slight recall degradation
    • Applications requiring faster indexing speeds
  • Available in Weaviate 1.31+ with simple configuration options

Challenges with multi-vector embeddings

State-of-the-art multi-vector models can dramatically improve retrieval performance by capturing more semantic information than single-vector models. ColBERT models preserve token-level meanings in text, while ColPali/ColQwen models identify and preserve information from different parts of an image, like figures in PDFs as well as textual information.

Single vector to multi-vector comparison
Single vector to multi-vector comparison

These advantages make multi-vector models a great fit for many use cases. However, multi-vector embeddings carry two potential disadvantages over their single-vector cousins, owing to their size and relative complexity.

Challenge 1: Memory footprint

Multi-vector embeddings comprise multiple vectors, each one representing a part of the object, such as a token (text) or a patch (image). Although each vector in a multi-vector embedding is smaller, the whole embedding tends to be larger than a typical single-vector embedding.

Multi-vector embeddings memory comparison
Multi-vector embeddings memory comparison

This can lead to a higher memory footprint in use, as many vector search systems use in-memory indexes such as HNSW.

How much larger? Well, as you can see from the image above, the total number of vectors in a multi-vector index will be greater than the single-vector index by average_vectors_per_embedding / (ratio_of_vector_length). As multi-vector embeddings can comprise hundreds up to thousands vectors per document, this can be much larger.

If we embed a million documents of ~100 tokens per document, a single-vector embedding model (of 768 dimensional single-precision, 32-bit, floating point numbers) may require 768 * 1M * 4 bytes = ~3.1GB of memory. On the other hand, a multi-vector embedding model (of 96 dimensions) may require a whopping 96 * 100 * 1M * 4 bytes = ~40GB!

Single vs multi-vector memory usage
Single vs multi-vector memory usage

And of course, higher memory use also means higher costs, whether it is your own hardware or via cloud infrastructure such as Weaviate Cloud.

Challenge 2: Speed

A system using multi-vector embeddings can also suffer from slower import and search speeds due to the increased size and complexity.

Vector search involves finding the most relevant embedding(s) from a sea of embeddings. HNSW speeds this up by building a multi-layered graph of vectors to speed up traversing to the right region of vectors and retrieving the right ones.

Multi-vector embeddings introduce additional complexities to this. Multiple vectors must be indexed into the graph per object at ingestion time, and at query time more comparisons are required to retrieve vectors and to calculate their overall similarity.

Given a document embedding D and a query embedding Q, a maxSim operator is typically used to compute the similarity:

sim(D,Q)=qQmaxdDqdsim(D,Q)=\sum_{q \in Q} \max_{d \in D} q \cdot d

This is a non-linear operation which loops over each query token and compute the similarity between that query token and all the document token choosing the most similar one. In other words, the MaxSim searches the "best-match" for a query term among all the document ones.

While an elegant method, this calculation is another overhead that is unique to multi-vector embeddings.

To improve the performance of multi-vector embeddings, Google's research group proposed MUVERA (Multi-Vector Retrieval via Fixed Dimensional Encodings) which aims to reduce the memory occupancy to store the multi-vector embeddings using fixed encodings.

MUVERA to the rescue

MUVERA encodes multi-vector embeddings into single vector embeddings by building fixed dimensional encodings (FDE) whose length is independent of the number of vectors in the original multi-vector embedding. This reduces the size of the embedding, and improves efficiency of calculations.

But how does MUVERA do this while minimizing losses in recall?

The key idea

MUVERA takes as input a multi-vector embedding (DD) and transforms it into a single-vector embedding (dsingled_{single}):

encode(xmulti)    xsinglex{D,Q}encode(x_{multi})\implies x_{single} \quad \quad x \in \{D, Q\}

In order to say if this function is "good enough" we want to maximize the similarity of the two key measures:

  • similarity of the encoded embeddings xsinglex_{single} (aka single-vector embedding), and
  • similarity of the multi-vector embeddings xmultix_{multi}

The more similar these two are, the better that xsinglex_{single} is as an approximation of xmultix_{multi}. Ideally, we want:

maxSim(D,Q)dsingleqsinglemaxSim(D,Q)\approx d_{single} \cdot q_{single}
wheredsingle=encode(D)qsingle=encode(Q)where \quad d_{single} = encode(D)\quad \land q_{single}=encode(Q)

This transform simplifies the approximate nearest neighbor search problem for multi-vector embeddings to that involving only single vector embeddings.

MUVERA high level overview
MUVERA high level overview

Recall the above scenario with a million documents and 100 vectors per document. Without the MUVERA encoding we would have 100M vectors to be indexed. Instead, using the FDE vector we will be working with only 1M vectors, leading to an HNSW graph of only 1% of the size!

So how does MUVERA achieve this goal?

How MUVERA works

Recall that the main goal here is to build an FDE dsingled_{single} which approximates the semantics of the multi-vector embedding DD.

MUVERA does this in four steps:

  1. Space partitioning
  2. Dimensionality reduction
  3. Multiple repetitions
  4. Final projection

Space partitioning

The first step consists of partitioning the vector space into BB "buckets". This can be achieved using k-means clustering or using locally sensitive hashing (LSH) functions.

Ideally, we want a partitioning function φ\varphi that given an input vector xx will return a bucket id k=φ(x)k=\varphi(x) where k=1,,Bk=1,\dots,B. Here is an illustration, assuming a set of 8 clusters:

MUVERA steps 1 - space partitioning
MUVERA steps 1

MUVERA steps 2 - fill empty clusters
MUVERA steps 2

As shown in the figure, each vector is assigned to one of the clusters. Two further calculations are done - one to derive a representative (e.g. average or centroid) sub-vector, and another to fill empty clusters.

Which function to use for space partitioning?

As we said, a function is needed to map from Rdim\mathbb{R}^{dim} into [B][B].

The first approach could be to use a clustering algorithm like K-Means that clusters data by assigning points to the nearest centroid. This algorithm is widely adopted, but it has two drawbacks:

  • It requires some data to be trained on
  • It's data-dependent, its quality depends on the data distribution which may not be known a priori, and it may drift over time (e.g. over the life of the downstream application)

To make the partition function data oblivious we may prefer the family of Locality Sensitive Hashing (LSH) functions. In particular, on the SimHash.

The SimHash algorithm relies on few simple steps:

  1. Sample ksimk_{sim} (parameter chosen by the user) Gaussian vectors g1,,gksimRdimg_1,\dots,g_{k_{sim}} \in \mathbb{R}^{dim}

  2. When computing the closest bucket to a vector xx, we will be computing the dot-product between xx and ksimk_{sim} vectors. From the ksimk_{sim} distances, we will set the i-th bit i=1,,ksimi=1,\dots,k_{sim} to 1 only if the corresponding distance is greater than 0, 0 otherwise.

    Mathematically speaking, we have:

    φ(x)=(1(g1x),,1(gksimx))\varphi(x)=\left(1(g_1 \cdot x), \dots,1(g_{k_{sim}}\cdot x)\right)
    where1(gix)=1ifgix>01(gix)=0otherwisewhere \quad 1(g_i \cdot x) = 1 \quad if \quad g_i \cdot x >0 \newline \quad \quad 1(g_i \cdot x) = 0 \quad otherwise

    The value φ(x)\varphi(x) will be a number composed of ksimk_{sim} bits. The total number of buckets we can have is B=2ksimB=2^{k_{sim}}.

From this point on, in the description of MUVERA algorithm we will be considering as partitioning function the SimHash.

Now that the partitioning function is ready, we can create a sub-vector dkd_k for each bucket k=1,,Bk=1,\dots,B as follows:

dk=1Dφ1(k)dD,φ(d)=kdd_k = \frac{1}{|D \cap \varphi^{-1}(k)|} \sum_{d \in D, \varphi(d)=k}d

The vector dkd_k contains the sum of all the token embeddings dDd \in D that belong to cluster φ(d)=k\varphi(d)=k. In addition, we have a normalization factor which counts how many token embeddings from DD have been mapped to cluster kk.

The space partitioning step is the same for the query encoding except for the multiplication term.

qk=qQ,φ(q)=kqq_k = \sum_{q \in Q, \varphi(q)=k}q

We create one sub-vector dkd_k for each bucket, at the end we will end up with BB sub-vectors d1,,dBd_1,\dots,d_B. To get one single vector dsingled_{single} we will be concatenating all dkd_k sub-vectors:

dsingle=(d1,d2,,dB)d_{single} = (d_1,d_2,\dots,d_B)

Let's now make some consideration on the dimensionality of the outcome. Assuming the dimensionality of the original token embeddings is dimdim then the length of dsingled_{single} will be BdimB* dim.

Where no vectors end up on a particular cluster, a nearest vector is assigned to it to calculate the sub-vector, so that no clusters end up being empty.

Dimensionality reduction

To reduce the dependency from the dimensionality of the token embeddings the next step is to reduce the dimensionality. This helps to manage the length of the final encoded vector by applying a random linear projection to make the resulting sub-vectors dkd_k smaller.

Given the parameter dprojd_{proj} we create a random matrix Sdproj×dimS^{d_{proj} \times dim} with ±1\pm1 which is used to reduce the dimensionality of each sub-vector.

For each dkd_k we compute a matrix-vector multiplication:

ψ(dk)=1dprojSdk\psi(d_k)=\frac{1}{\sqrt{d_{proj}}} \cdot S \cdot d_k

The resulting sub-vector will have length dprojd_{proj}.

Now the FDE vector will become:

dsingle=(ψ(d1),ψ(d2),,ψ(dB))d_{single}=(\psi(d_1), \psi(d_2),\dots,\psi(d_B))

Notice the dimensionality will no longer be BdimB*dim, but BdprojB *d_{proj} where dproj<<dimd_{proj} << dim.

MUVERA steps 3 - dimensionality reduction
MUVERA steps 3

Applying a random projection may seem... random! However, this is part of the secret sauce of making MUVERA work. The original authors chose the particular distribution of the random matrics to follow the Johnson Lindenstrauss Lemma, which in turn preserves the dot products between vectors that we care about.

Multiple Repetitions

To increase the accuracy coming from the approximation steps, we can repeat the processes above (space partitioning + dimensionality reduction) multiple times concatenating all vectors together.

Suppose we repeat the partition and dim reduction steps RrepsR_{reps} times obtaining different single-vector we can concatenate them obtaining only one single-vector.

dsingle=(dsingle1,dsingle2,,dsingleRreps)d_{single}=(d_{single_1}, d_{single_2}, \dots, d_{single_{R_{reps}}} )

The resulting vector will have dimensionality RrepsBdprojR_{reps}*B*d_{proj} .

Final Projection

The vector obtained concatenating all the single vectors could lead to quite long vectors. In order to reduce its dimensionality we can apply a final random projection the same described in step 2 to the overall vector. Using the random matrix Sdfinal×dimS^{d_{final} \times dim} where dfinald_{final} is a parameter

FDE(D)=ψfinal(dsingle)FDE(D)=\psi_{final}(d_{single})

This will reduce the final FDE dimensionality yet again.

MUVERA parameters

During the description of the algorithm we have seen four parameters:

  • ksimk_{sim}: number of Gaussian vectors sampled, the number of buckets we will have is B=2ksimB=2^{k_{sim}}
  • dprojd_{proj}: dimensionality of the sub-vectors representing a bucket in the space
  • RrepsR_{reps}: number of times the partition step and dimensionality step are executed
  • dfinald_{final}: final dimensionality of the FDE vector - FDE(D)RdfinalFDE(D)\in\mathbb{R}^{d_{final}}

At the end of the day, these are the key tunable parameters from the user perspective. In terms of the Weaviate implementation - we've chosen what we see as sensible defaults for the majority of use cases.

But as always, you can tune these to suit your particular needs. But before you do that - here's some real-life results from our internal testing.

note

In the Weaviate implementation, we do not implement the final projection as we found it to produce limited benefits, against the added step. So the user-selectable parameters are k_sim, d_proj and r_reps.

Impact of MUVERA

So what are the effects of MUVERA? As in - what's in it for you?

To evaluate this, we used the LoTTE benchmark, in particular lotte-lifestyle which is made up of around 119k documents. Each document was encoded using colbertv2.0 which produced an average of 130 vectors per document.

This would produce 15M vectors of dimensionality 128. This means the total number of floating points stored would be 1.9B, leading to around 8 GB of memory. It might seem that extreme, until you realize that this is a small-ish dataset - and that's even before we add the HNSW graph itself!

On the other hand, if we enable MUVERA with the following parameters:

  • ksim=4k_{sim}=4
  • dproj=16d_{proj}=16
  • Rreps=10R_{reps}=10

We would encode each document using one vector of dimensionality 2560 = 2^4*16*10 , ending up with 304M floating points stored. An instant saving of almost 80% in memory footprint.

In addition to this, when enabling MUVERA we would have a HNSW (Hierarchical Navigable Small World Graph) with 119k nodes whereas without it, the graph would have 15M nodes, each node having many (e.g. 32-128) edges to other nodes. At a maxConnections value of 128, for example, this could save tens of gigabytes of memory.

Take a look at the example outputs below from our experimentation. We compare four scenarios:

  • Raw multi-vector embeddings + SQ
  • MUVERA with scalar quantization (SQ)
Heap allocation without using MUVERA + sq vs. MUVERA + sq
Heap allocation without using MUVERA + sq vs. MUVERA + sq

Import time without using MUVERA + sq vs. MUVERA + sq
Import time without using MUVERA + sq vs. MUVERA + sq

QPS without using MUVERA + sq vs. MUVERA + sq
QPS without using MUVERA + sq vs. MUVERA + sq

Recall without using MUVERA + sq vs. MUVERA + sq
Recall without using MUVERA + sq vs. MUVERA + sq

Test setup

The experiments have been run on the following machine:

  • CPU/GPU: AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics
  • RAM: 64GB

Clear benefits: Memory & ingestion speed

As expected, there is a significant reduction in memory footprint, going from a baseline of around 12 GB to under 1GB for MUVERA.

What does that mean in terms of dollar amounts? Well, the difference in compute costs at a hyperscaler for a server with 12 GB of memory and another with 1 GB would be many tens of thousands of dollars per year, if not hundreds of thousands.

If your dataset size ranges in tens or millions or more, as many of our users' datasets do, this may already be a strong motivator for MUVERA.

Not only that, the time taken to import the data is hugely different here. By that, we refer to the time taken for the object data to be added to Weaviate, and the vectors added to the HNSW index. Since adding multi-vector embedding requires tens or hundreds of vectors to be added to the index, this adds significant overhead.

With MUVERA, this is reduced significantly once again. Adding the ~110k objects in the LoTTE dataset required over 20 minutes (or, only around 100 objects/s) in the baseline case, but it was down to around 3-6 minutes in the various MUVERA scenarios.

Again, when considering working at scale - you may want to consider a ~3 hour job to import a million objects may be viable for you. If not, that may be another strong incentive for you to consider MUVERA.

It should be noted that at the same ef value, enabling MUVERA increases query throughput as well, owing to only needing to deal with one vector per object rather than multiple. But - that's not quite an apples-to-apples comparison. And here's why.

Costs: Recall & query throughput

As we can see in the charts, the main drawback of MUVERA is a loss on the recall side. The downside also may seem particularly challenging, since multi-vector models achieve such high recall values in the first place.

However, some of this can be mitigated through the HNSW search settings. As the graphs show, setting a higher ef value in the query settings (e.g. >512) can increase recall to 80%+, and over 90% at 2048.

As ef increases the retrieved candidate set, it does have a knock-on effect of decreasing the query throughput (measured by queries per second, or qps).

In other words - the main tradeoffs of enabling MUVERA are going to be the reduction in recall, and associated reduction in throughput by needing to use higher ef values.

Nothing's ever so simple, right? 😅 What's clear from these charts is that there is a definite case to be made for using MUVERA. The specifics, however, would very much depend on your priorities.

Wrap-up

MUVERA compared
MUVERA compared

In summary, MUVERA offers a compelling path forward for working with multi-vector models at scale.

By transforming multi-vector representations into fixed dimensional encodings, it delivers significant memory savings and improved query speeds while maintaining relatively strong retrieval quality.

Weaviate's implementation of MUVERA enables the ability to further compress these encodings with quantization for production deployments at scale, dramatically reducing the cost and overhead needed for multi-vector embeddings.

As always - using the Weaviate implementation (available form Weaviate 1.31 onwards) may be the easiest part of all this. You may be surprised to hear that just a couple of lines of code is all you need to enable MUVERA in a Weaviate collection.

If your use case could benefit from using multi-vector embeddings, and your use case may require a non-trivial dataset size, MUVERA could be a part of the solution set for you. We encourage you to check it out.

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

Don't want to miss another blog post?

Sign up for our bi-weekly newsletter to stay updated!


By submitting, I agree to the Terms of Service and Privacy Policy.