Keyword Search (BM25)
Keyword search is an exact matching-based search using "tokens", or strings of characters.
It uses the BM25 algorithm, which ranks matching documents according to their relevance to a given search query. At a high level, the BM25 algorithm uses the count of query terms in the document (term frequency) against the overall frequency of the term in the dataset (inverse document frequency) to calculate a relevance score.
More specifically, Weaviate uses the BM25F algorithm, which extends BM25 to support using multiple fields in the search index.
A keyword search determines the best matches based on the matches of exact tokens contained in the query against those of the stored objects. As a result, a keyword search is a good choice when exact matches (e.g. exact domain-specific language, precise categories or tags) are important. For example:
- Searching for documents containing specific technical terms
- Identifying articles by precise keywords or tags
This differs from vector search, which finds semantically similar content even when the exact words don't match. You might use keyword search when precision is more important than finding related concepts.
Keyword search in Weaviate
In Weaviate, a keyword search will return the objects best matching the query, as measured by BM25F "score".
The "F" in BM25F stands for "field", indicating that it is a field-specific version of BM25. This allows for different weights for different fields, or properties, of the objects.
In Weaviate, they are used interchangeably, as the BM25F algorithm is used to calculate the scores for keyword searches. Here we will refer to it generally as BM25.
A BM25 score is calculated based on the frequency of the query tokens in the object properties, as well as the length of the object properties and the query.
When an input string such as "A red Nike shoe"
is provided as the query, Weaviate will:
- Tokenize the input (e.g. to
["a", "red", "nike", "shoe"]
) - Remove any stopwords (e.g. remove
a
, to produce["red", "nike", "shoe"]
) - Determine the BM25 scores against selected properties of the database objects, based on the BM25 parameters and any property boosting.
- Return the objects with the highest BM25 scores as the search results
Tokenization
Tokenization for keyword searches refers to how each source text is split up into individual "tokens" to be compared and matched.
The default tokenization method is word
.
Other tokenization methods such as whitespace
, lowercase
, and field
are available, as well as specialized ones such as GSE
or kagome_kr
for other languages (more details).
Set the tokenization option in the inverted index configuration for a collection.
The term "tokenization" is used in other contexts such as vectorization, or language generation. Note that each of these typically use different tokenizers to meet different requirements. This results in different sets of tokens, even from the same input text.
Stopwords
Stopwords are words that are filtered out before processing text.
Weaviate uses configurable stopwords in calculating the BM25 score. Any tokens that are contained in the stopword list will be ignored from the BM25 score calculation.
See the reference page for more details.
BM25 Parameters
BM25 is a scoring function used to rank documents based on the query terms appearing in them. It has two main parameters that control its behavior:
k1
(default: 1.2): Controls term frequency saturation. Higher values mean that multiple occurrences of a term continue to increase the score moreb
(default: 0.75): Controls document length normalization. Values closer to 1 mean more normalization for document length
Set custom k1
and b
values for a collection.
Selected properties
A BM25 query can optionally specify which object properties are to be included in the score calculations.
By default, all text
properties are included in a BM25 calculation. There are two ways to vary this:
- In the collection configuration, set
indexSearchable
for a property tofalse
. This property will then be ignored in all BM25 searches. - Specify which properties to search at query time. This will only apply for that particular query.
Property Boosting
Property boosting allows a query apply different weights to different properties when calculating the final BM25 score.
This is useful when certain properties are more important for search than others.
For example, when searching an e-commerce catalog, you could boost the title property and its categories over the product description.
Set the property weights at query time.
Combining with Vector Search
Keyword search can be combined with vector search in Weaviate to perform a hybrid search. This allows you to leverage both:
- Exact matching capabilities of keyword search
- Semantic understanding of vector search
See Hybrid Search for more information.
Notes and Best Practices
Here are some key considerations when using keyword search:
Tokenization Choice
- Choose based on your data and search requirements. For example, use
word
tokenization for natural language text, but considerfield
for URLs or email addresses that need exact matching as a whole. - For multilingual content, consider specialized tokenizers like
GSE
for Chinese/Japanese orkagome_kr
for Korean - Consider special characters and case sensitivity needs
- Test your tokenization choice with subsets of your data and queries to ensure it handles special characters and case sensitivity as expected. You could perform these experiments with vectorization disabled to save resources/costs, as the two processes are independent.
- Choose based on your data and search requirements. For example, use
Performance Optimization
- Index only the properties you need for search
- Consider combining keyword search with vector search (i.e. perform a hybrid search) as a starting point, especially where you cannot anticipate users' behavior
Query Optimization
- Consider boosting properties that are more important for search (e.g. title, category) over others (e.g. description)
- Only modify
k1
andb
values if you have a good reason to do so, as the defaults are generally well-suited for most use cases
Further resources
Questions and feedback
If you have any questions or feedback, let us know in the user forum.