Skip to main content

Strategies to improve search results

Overview

In addition to selecting the right search types, there are also strategies you can employ to improve the quality of your search results.

Let's explore some of these strategies.

The key to improving vector search is to make sure that the vector representation of the object is fit for purpose, so as to suit the search needs.

Vectorizer selection

Unless you are inserting data with your own vectors, you will be using a Weaviate vectorizer module, and a model within that module, to generate vectors for your data.

The choice of vectorizer module and model is important, as it will determine what aspects of the data are captured in the vector representation, and how well the model is able to "understand" the data.

First and foremost, you should select a vectorizer module that is best suited for your data type. For example, if you are working with text data, you should use the text2vec module, and if you are using image or multi-modal data, you should likely use the multi2vec module.

We will cover vectorizer selection in another unit. But, if you are not sure where to start, try:

  • text2vec-cohere, or text2vec-openai for text data (API-based)
    • Cohere offers a multi-lingual model that can be used with over 100 languages.
  • multi2vec-clip for image or image and text data.

If you are working with text and prefer to run a local inference container, try text2vec-transformers, with a popular model such as sentence-transformers/all-MiniLM-L12-v2.

Try a re-ranker

Re-ranker modules are a great way to improve the quality of your search results.

A re-ranker module is a module that takes in the results of a vector search, and re-ranks the results based on additional criteria, or a different model. This allows a higher-quality (but slower) model to be used for re-ranking, while still benefiting from the fast first stage search.

For example, you can use the text2vec-cohere module to perform a vector search, and then use the reranker-cohere module to re-rank the results using a different model.

Property selection

Vectorization captures the "meaning" of the object. Accordingly, if a property is not relevant to the criteria to be applied for search, it should be excluded from the vectorization process.

As an example, if a product object includes metadata such as its manufacturing process or location, and the vector search is intended to be based on the product's features, then the properties for manufacturing process and location should be excluded from the vectorization process.

You can do this by specifying whether to skip a property during vectorization, as shown below. Note that you can do the same with the collection name, and the property name.

products = client.collections.create(
name="Product",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
vectorize_collection_name=True
),
properties=[
wvc.config.Property(
name="name",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=True,
),
wvc.config.Property(
name="description",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="manufacturing_process",
data_type=wvc.config.DataType.TEXT,
skip_vectorization=True # Skip unwanted property
),
]
)

Chunking

Chunking refers to the process of splitting a text into smaller chunks, and vectorizing each chunk separately. This is very important, as it defines how much information each vector contains.

As a rule of thumb, the more granular the search needs, the smaller the chunk size should be. For example, if you are searching for specific concepts and ideas, you should chunk data into smaller units such as sentences or small windows of text. Alternatively, if you are searching for broader concepts, such as finding relevant chapters or books, you might chunk text accordingly.

Read more about it in the chunking unit of Weaviate Academy.

Tokenization

Although we refer to BM25 search as a "keyword" search, in reality the exact matches are for "tokens", rather than words. This is a different tokenization process to that used for generating vector embeddings, but instead, it is used to build the inverted index for BM25 searches and filtering.

Accordingly, the tokenization process is very important, as it determines what tokens are used for matching.

The available options are: word, lowercase, whitespace, and field. The default (word) might be sufficient for prose, but for text where exact matches including case and symbols are important, something like whitespace might be more appropriate.

Available tokenization options:

Tokenization MethodExplanationIndexed Tokens
word (default)Keep only alpha-numeric characters, lowercase them, and split by whitespace.hello, beautiful, world
lowercaseLowercase the entire text and split on whitespace.hello,, (beautiful), world
whitespaceSplit the text on whitespace. Searches/filters become case-sensitive.Hello,, (beautiful), world
fieldIndex the whole field after trimming whitespace characters.Hello, (beautiful) world
trigramSplit the property as rolling trigrams.Hel, ell, llo, lo,, ...
gseUse the gse tokenizer to split the property.See gse docs

You can set tokenization in the collection configuration.

things = client.collections.create(
name="SomeCollection",
properties=[
wvc.config.Property(
name="name",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD # Default
),
wvc.config.Property(
name="description",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WHITESPACE # Will keep case & special characters
),
wvc.config.Property(
name="email",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.FIELD # Do not tokenize at all
),
]
)

Select and boost properties

If you observe that matches in some properties are having too much of an impact, you can exclude them from the search, and/or boost the importance certain properties.

For example, matches in the description property might be more important than matches in the notes property. You can specify this at query time.

questions = client.collections.get("JeopardyQuestion")

response = questions.query.bm25(
"animal",
limit=5,
query_properties=["question^3", "answer"] # Boost the impact of "question" property by 3
)

for o in response.objects:
print(o.properties)

Alpha

The alpha parameter determines the balance between the vector and keyword search results.

If you want to configure your search to be more vector-based, you can increase the alpha value. Conversely, if you want to configure your search to be more keyword-based, you can decrease the alpha value.

questions = client.collections.get("JeopardyQuestion")

response = questions.query.hybrid(
"imaging",
alpha=0.1, # Mostly a vector search (Try it with alpha=0.9)
limit=5
)

for o in response.objects:
print(o.properties)

Fusion algorithm

The fusion algorithm determines how the results from the vector and keyword searches are combined.

By default, an inverse of the ranks from each results set are summed, in what is called the "ranked fusion" algorithm. However, you can also use the "relative score fusion" algorithm, which sums normalized scores from each results set.

Generally, we have found that the "relative score fusion" algorithm works better, but you should try both to see which works best for your use case.

questions = client.collections.get("JeopardyQuestion")

response = questions.query.hybrid(
"imaging",
fusion_type=wvc.query.HybridFusion.RELATIVE_SCORE, # Use relative score fusion
limit=5
)

for o in response.objects:
print(o.properties)