Strategies to improve search results
Overview
In addition to selecting the right search types, there are also strategies you can employ to improve the quality of your search results.
Let's explore some of these strategies.
Improve vector search
The key to improving vector search is to make sure that the vector representation of the object is fit for purpose, so as to suit the search needs.
Vectorizer selection
Unless you are inserting data with your own vectors, you will be using a Weaviate vectorizer module, and a model within that module, to generate vectors for your data.
The choice of vectorizer module and model is important, as it will determine what aspects of the data are captured in the vector representation, and how well the model is able to "understand" the data.
First and foremost, you should select a vectorizer module that is best suited for your data type. For example, if you are working with text data, you should use the text2vec
module, and if you are using image or multi-modal data, you should likely use the multi2vec
module.
We will cover vectorizer selection in another unit. But, if you are not sure where to start, try:
text2vec-cohere
, ortext2vec-openai
for text data (API-based)- Cohere offers a multi-lingual model that can be used with over 100 languages.
multi2vec-clip
for image or image and text data.
If you are working with text and prefer to run a local inference container, try text2vec-transformers
, with a popular model such as sentence-transformers/all-MiniLM-L12-v2
.
Try a re-ranker
Re-ranker modules are a great way to improve the quality of your search results.
A re-ranker module is a module that takes in the results of a vector search, and re-ranks the results based on additional criteria, or a different model. This allows a higher-quality (but slower) model to be used for re-ranking, while still benefiting from the fast first stage search.
For example, you can use the text2vec-cohere
module to perform a vector search, and then use the reranker-cohere
module to re-rank the results using a different model.
Property selection
Vectorization captures the "meaning" of the object. Accordingly, if a property is not relevant to the criteria to be applied for search, it should be excluded from the vectorization process.
As an example, if a product object includes metadata such as its manufacturing process or location, and the vector search is intended to be based on the product's features, then the properties for manufacturing process and location should be excluded from the vectorization process.
You can do this by specifying whether to skip a property during vectorization, as shown below. Note that you can do the same with the collection name, and the property name.
- Python
products = client.collections.create(
name="Product",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
vectorize_collection_name=True
),
properties=[
wvc.config.Property(
name="name",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=True,
),
wvc.config.Property(
name="description",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="manufacturing_process",
data_type=wvc.config.DataType.TEXT,
skip_vectorization=True # Skip unwanted property
),
]
)
Chunking
Chunking refers to the process of splitting a text into smaller chunks, and vectorizing each chunk separately. This is very important, as it defines how much information each vector contains.
As a rule of thumb, the more granular the search needs, the smaller the chunk size should be. For example, if you are searching for specific concepts and ideas, you should chunk data into smaller units such as sentences or small windows of text. Alternatively, if you are searching for broader concepts, such as finding relevant chapters or books, you might chunk text accordingly.
Read more about it in the chunking unit of Weaviate Academy.
Improve keyword search
Tokenization
Although we refer to BM25 search as a "keyword" search, in reality the exact matches are for "tokens", rather than words. This is a different tokenization process to that used for generating vector embeddings, but instead, it is used to build the inverted index for BM25 searches and filtering.
Accordingly, the tokenization process is very important, as it determines what tokens are used for matching.
The available options are: word
, lowercase
, whitespace
, and field
. The default (word
) might be sufficient for prose, but for text where exact matches including case and symbols are important, something like whitespace
might be more appropriate.
Available tokenization options:
Tokenization Method | Explanation | Indexed Tokens |
---|---|---|
word (default) | Keep only alpha-numeric characters, lowercase them, and split by whitespace. | hello , beautiful , world |
lowercase | Lowercase the entire text and split on whitespace. | hello, , (beautiful) , world |
whitespace | Split the text on whitespace. Searches/filters become case-sensitive. | Hello, , (beautiful) , world |
field | Index the whole field after trimming whitespace characters. | Hello, (beautiful) world |
trigram | Split the property as rolling trigrams. | Hel , ell , llo , lo, , ... |
gse | Use the gse tokenizer to split the property. | See gse docs |
kagome_ja | Use the Kagome tokenizer with a Japanese (IPA) dictionary to split the property. | See kagome docs and the dictionary. |
kagome_kr | Use the Kagome tokenizer with a Korean dictionary to split the property. | See kagome docs and the Korean dictionary. |
You can set tokenization in the collection configuration.
- Python
things = client.collections.create(
name="SomeCollection",
properties=[
wvc.config.Property(
name="name",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD # Default
),
wvc.config.Property(
name="description",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WHITESPACE # Will keep case & special characters
),
wvc.config.Property(
name="email",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.FIELD # Do not tokenize at all
),
]
)
Select and boost properties
If you observe that matches in some properties are having too much of an impact, you can exclude them from the search, and/or boost the importance certain properties.
For example, matches in the description
property might be more important than matches in the notes
property. You can specify this at query time.
- Python
questions = client.collections.get("JeopardyQuestion")
response = questions.query.bm25(
"animal",
limit=5,
query_properties=["question^3", "answer"] # Boost the impact of "question" property by 3
)
for o in response.objects:
print(o.properties)
Improve hybrid search
Alpha
The alpha parameter determines the balance between the vector and keyword search results.
If you want to configure your search to be more vector-based, you can increase the alpha value. Conversely, if you want to configure your search to be more keyword-based, you can decrease the alpha value.
- Python
questions = client.collections.get("JeopardyQuestion")
response = questions.query.hybrid(
"imaging",
alpha=0.1, # Mostly a vector search (Try it with alpha=0.9)
limit=5
)
for o in response.objects:
print(o.properties)
Fusion algorithm
The fusion algorithm determines how the results from the vector and keyword searches are combined.
By default, an inverse of the ranks from each results set are summed, in what is called the "ranked fusion" algorithm. However, you can also use the "relative score fusion" algorithm, which sums normalized scores from each results set.
Generally, we have found that the "relative score fusion" algorithm works better, but you should try both to see which works best for your use case.
- Python
questions = client.collections.get("JeopardyQuestion")
response = questions.query.hybrid(
"imaging",
fusion_type=wvc.query.HybridFusion.RELATIVE_SCORE, # Use relative score fusion
limit=5
)
for o in response.objects:
print(o.properties)