BM25 (Keyword) searches
Overview
A BM25 search is one implementation of what is commonly called a 'keyword' search. Broadly speaking, it works by matching the search terms between the query and the data objects in the index.
About bm25
queries
How it works
When a user submits a BM25 query, Weaviate will look for objects that contain the search terms in the text properties of the objects. Then, it will rank the results based on how many times the search terms appear in the text properties of the objects.
In this way, a BM25 query is different to keyword-based filtering, which simply includes or excludes objects based on the provided set of conditions.
bm25
query syntax
A BM25 query is shown below. Each BM25 query:
- Must include a query string, which can be any length,
- Can optionally include a list of
properties
to search, - Can optionally include weights for each searched property, and
- Can optionally request a
score
for each result.
- Python
response = (
client.query
.get("JeopardyQuestion", ["question", "answer"])
.with_bm25(
query="food", # Query string
properties=["question^2", "answer"] # Searched properties, including boost for `question`
)
.with_additional("score") # Include score in the response
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
The above query will return the top 3 objects based on its BM25F score, based on the query string "food"
. The query will search the question
and answer
properties of the objects, from which question
property will be boosted by a factor of 3.
See the JSON response
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}
Try varying the boost factor, and the query string. What happens to the results?
Tokenization and bm25
searches
Why tokenization matters
In an earlier unit, we briefly discussed the inverted index, and that it stores a "tokenized" index of data.
When a BM25 query is submitted, Weaviate will search each property according to its tokenization property. For example, if a property is tokenized with the word
tokenization option, it will tokenize the query string into its constituent, lowercase, words, and search for each word in the index. On the other hand, if a property uses a field
tokenization, Weaviate will look for the entire query string in the index.
This is different to any tokenization in the context of, for example, language models or vectorization models. Tokenization in the context of the current section only applies to the inverted index.
More concretely, let's take a look at some examples.
word
tokenization
In this example, we search through the question
property with the query string Jeopardy
. The question
property is tokenized with the word
tokenization option.
- Python
response = (
client.query
.get(
class_name="JeopardyQuestion",
properties=["question", "round"]
)
.with_bm25(
"Jeopardy",
properties=["question"]
)
.with_limit(2)
.do()
)
print(json.dumps(response, indent=2))
The word
tokenization keeps alpha-numeric characters in lowercase, and splits them by whitespace. Accordingly, the search results include those where the question
property contains the string Jeopardy!
, which is the title of the TV show.
See the JSON response
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"question": "Capistrano swallows, Undeliverable mail, \"Jeopardy!\" champs",
"round": "Jeopardy!"
},
{
"question": "This opera star & \"Celebrity Jeopardy!\" contestant began life as Belle Silverman",
"round": "Double Jeopardy!"
}
]
}
}
}
Now, let's take a look at the same query, but with the field
tokenization option.
field
tokenization
In this example, the query string remains the same (Jeopardy
), however we are now searching the round
property, which is tokenized with the field
tokenization option.
- Python
response = (
client.query
.get(
class_name="JeopardyQuestion",
properties=["question", "round"]
)
.with_bm25(
"Jeopardy",
properties=["round"]
)
.with_limit(2)
.do()
)
print(json.dumps(response, indent=2))
The field
tokenization trims whitespace characters and then keeps the entire string as is. Accordingly, the search does not return any results, even though we know that round
values include those such as Jeopardy!
and Double Jeopardy!
.
See the JSON response
{
"data": {
"Get": {
"JeopardyQuestion": []
}
}
}
Try changing the query to Jeopardy!
. What happens to the results?
Rules of thumb
The full list of tokenization options are word
, whitespace
, lowercase
and field
. A rule of thumb on when to use each option is to use word
for long text where you want to retrieve partial matches, and field
for short text where you only want to retrieve exact matches. The others are somewhere in between, and may be useful in specific situations, where for example you want case to matter (whitespace
) or special characters to be respected (lowercase
).
BM25F scoring
The exact algorithm used for scoring and ranking the results is called the BM25F algorithm. The details are beyond the scope of this course, but the gist is that the BM25F algorithm is a variant of the BM25 algorithm, where the F
stands for "field". It is used to score and rank results based on the fields that are searched.
If you would like to delve into the details of the exact algorithm, you can review this Wikipedia page.
Review
Key takeaways
- BM25 search matches search terms between the query and data objects in the index and ranks results based on the frequency of those terms.
- A BM25 query must include a query string, and can optionally include a list of properties to search, weights for each searched property, and a request for a score for each result.
- BM25 queries are impacted by the tokenization of the properties being searched; for instance,
word
tokenization splits the query string into lowercase words andfield
tokenization searches for the entire query string. - Consider your search use case for picking tokenization options. For example, use
word
for long text with partial matches, andfield
for short text with exact matches. - BM25F scoring, where 'F' stands for 'field', is used to score and rank the search results based on the fields that are searched.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.