Identify needs & compile candidates

Identify needs

A systematic approach to model selection starts with clearly identifying requirements. Organizing these requirements into categories can help ensure you consider all relevant factors when evaluating embedding models.

Here are some of our key considerations:

Data Characteristics

Factor	Key Questions	Why It Matters
Modality	Are you dealing with text, images, audio, or multimodal data?	Models are built for specific modality/modalities.
Language	Which languages must be supported?	Models are trained & optimized for specific language(s), leading to trade-offs in performance.
Domain	Is your data general or domain-specific (legal, medical, technical)?	Domain-specific models (e.g. medical) understand specialized vocabulary and concepts.
Length	What's the typical length of your documents and queries?	Input token context windows vary between models, from as small as `256` tokens to `8192` tokens for example. However, longer context windows typically require exponentially higher compute and latency.
Asymmetry	Will your queries differ significantly from your documents?	Some models are built for asymmetric query to document comparisons. So queries like `laptop won't turn on` can easily identify documents like `Troubleshooting Power Issues: If your device fails to boot...`.

Performance Needs

Factor	Key Questions	Why It Matters
Accuracy (recall)	How critical is it that all the top results are retrieved?	Higher accuracy requirements may justify more expensive or resource-intensive models.
Latency	How quickly must queries be processed?	Larger models with better performance often have slower inference times. For inference services, faster services will cost more.
Throughput	What query volume do you anticipate? Will there be traffic spikes?	Larger models with better performance often have lower processing capacity. For inference services, increased throughput will increase costs.
Volume	How many documents will you process?	Larger embedding dimensions increase memory requirements for your vector store. This will impact resource requirements and affect costs at scale.
Task type	Is retrieval the only use case? Or will it also involve others (e.g. clustering or classification) ?	Models have strengths and weaknesses; a model excellent at retrieval might not excel at clustering. This will drive your evaluation & selection criteria.

Operational Factors

Factor	Key Questions	Why It Matters
Hardware limitations	What computational resources are available for hosting & inference?	Hardware availability (costs, GPU/TPU availability) will significantly affect your range of choices.
API rate limits	If using a hosted model, what are the provider's limits?	Rate limits can bottleneck applications, or limit potential growth.
Deployment & maintenance	What technical expertise and resources are required?	Is self-hosting a model an option, or should you look at API-based hosted options?

Business Requirements

Factor	Key Questions	Why It Matters
Hosting options	Do you need self-hosting capabilities, or is a cloud API acceptable?	Self-hosting ➡️ more control at higher operational complexity; APIs ➡️ lower friction at higher dependencies.
Licensing	What are the licensing restrictions for commercial applications?	Some model licenses or restrictions may prohibit certain use cases.
Long-term support	What guarantees exist for the model's continued availability?	If a model or business is abandoned, downstream applications may need significant reworking.
Budget	What are your cost limits and expenditure preferences?	Embedding costs can add up over time, but self-hosting can incur high upfront costs.
Privacy & Compliance	Are there data privacy requirements or industry regulations to consider?	Some industries require specific models. And privacy requirements may impose hosting requirements.

Documenting these requirements creates a clear profile of your ideal embedding model, which will guide your selection process and help you make informed trade-offs.

Compile candidate models

After identifying your needs, create a list of potential embedding models to evaluate. This process helps focus your detailed evaluation on the most promising candidates.

There are hundreds of embedding models available today, with new ones being released regularly. For this many models, even a simple screening process would be too time-consuming.

As a result, we suggest identifying an initial list of models with a simple set of heuristics, such as these:

Account for model modality

This is a critical, first-step filter. A model can only support the modality/modalities that it is designed and trained for.

Some models (e.g. Cohere embed-english-v3.0) are multimodal, while others (e.g. Snowflake’s snowflake-arctic-embed-l-v2.0) are unimodal.

No matter how good a model is, a text-only model such as snowflake-arctic-embed-l-v2.0 will not be able to perform image retrieval. Similarly, a ColQwen model cannot be used for plain text retrieval.

Favor models already available

If your organization already uses embedding models for other applications, these are great starting points. They are likely to have been screened, evaluated and approved for use, and accounts/billing already configured. For local models, this would mean that the infrastructure is already available.

This also extends to models available through your other service providers.

You may be already using generative AI models through providers such as Cohere, Mistral or OpenAI. Or, perhaps your hyperscaler partners such as AWS, Microsoft Azure or Google Cloud provide embedding models.

In many cases, these providers will also provide access to embedding models, which would be easier to adopt than those from a new organization.

Try well-known models

Generally, well-known or popular models are popular for a reason.

Industry leaders in AI such as Alibaba, Cohere, Google, NVIDIA and OpenAI all produce embedding models for different modalities, languages and sizes. Here are a few samples of their available model families:

Provider	Model families
Alibaba	`gte`, `Qwen`
Cohere	`embed-english`, `embed-multilingual`
Google	`gemini-embedding`, `text-embedding`
NVIDIA	`NV-embed`
OpenAI	`text-embedding`, `ada`

There are also other families of models that you can consider.

For example, the ColPali family of models for image embeddings and CLIP / SigLIP family of models for multimodal (image and text) are well-known in their respective domains. Then, nomic, snowflake-arctic, MiniLM and bge models are some examples of well-known language retrieval models.

These popular models tend to be well-documented, discussed and widely supported.

As a result, they tend to be easier than the more obscure models to use, evaluate, troubleshoot and use.

Benchmark leaders

Models that perform well on standard benchmarks may be worth considering. Resources like MTEB Leaderboard can help identify high-performing models.

As an example, the screenshot below shows models on MTEB at a size of fewer than 1 billion parameters, sorted by their retrieval performance.

MTEB example - sorted by retrieval performance

It shows some models that we’ve already discussed - such as the showflake-arctic, Alibaba’s gte, or BAAI’s bge models.

But additionally, you can see already a number of high-performing models that we hadn’t discussed. Microsoft research's intfloat/multilingual-e5-large-instruct or JinaAI’s jinaai/jina-embeddings-v3 model are both easily discoverable here.

Note that as of 2025, the MTEB contains different benchmarks to assess different capabilities, such as the linguistic or modality needs.

When viewing benchmarks, make sure to view the right set of benchmarks, and the appropriate columns. In the example below, note that the page shows results for MIEB (image retrieval), with results sorted by Any to Any Retrieval.

MIEB example - sorted by any to any retrieval

The MTEB is filterable and sortable by various metrics. So, you can arrange it to suit your preferences and add models to your list as you see fit.

You should be able to compile a manageable list of models relatively quickly using these techniques. This list can then be manually reviewed for detailed screening.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Identify needs​

Data Characteristics​

Performance Needs​

Operational Factors​

Business Requirements​

Compile candidate models​

Account for model modality​

Favor models already available​

Try well-known models​

Benchmark leaders​

Questions and feedback​