Skip to main content

BM25 (Keyword) searches

Overview

A BM25 search is one implementation of what is commonly called a 'keyword' search. Broadly speaking, it works by matching the search terms between the query and the data objects in the index.

About bm25 queries

How it works

When a user submits a BM25 query, Weaviate will look for objects that contain the search terms in the text properties of the objects. Then, it will rank the results based on how many times the search terms appear in the text properties of the objects.

In this way, a BM25 query is different to keyword-based filtering, which simply includes or excludes objects based on the provided set of conditions.

bm25 query syntax

A BM25 query is shown below. Each BM25 query:

  • Must include a query string, which can be any length,
  • Can optionally include a list of properties to search,
  • Can optionally include weights for each searched property, and
  • Can optionally request a score for each result.
response = (
client.query
.get("JeopardyQuestion", ["question", "answer"])
.with_bm25(
query="food", # Query string
properties=["question^2", "answer"] # Searched properties, including boost for `question`
)
.with_additional("score") # Include score in the response
.with_limit(3)
.do()
)

print(json.dumps(response, indent=2))

The above query will return the top 3 objects based on its BM25F score, based on the query string "food". The query will search the question and answer properties of the objects, from which question property will be boosted by a factor of 3.

See the JSON response
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}
Exercise

Try varying the boost factor, and the query string. What happens to the results?

Tokenization and bm25 searches

Why tokenization matters

In an earlier unit, we briefly discussed the inverted index, and that it stores a "tokenized" index of data.

When a BM25 query is submitted, Weaviate will search each property according to its tokenization property. For example, if a property is tokenized with the word tokenization option, it will tokenize the query string into its constituent, lowercase, words, and search for each word in the index. On the other hand, if a property uses a field tokenization, Weaviate will look for the entire query string in the index.

This applies to the inverted index only

This is different to any tokenization in the context of, for example, language models or vectorization models. Tokenization in the context of the current section only applies to the inverted index.

More concretely, let's take a look at some examples.

word tokenization

In this example, we search through the question property with the query string Jeopardy. The question property is tokenized with the word tokenization option.

response = (
client.query
.get(
class_name="JeopardyQuestion",
properties=["question", "round"]
)
.with_bm25(
"Jeopardy",
properties=["question"]
)
.with_limit(2)
.do()
)

print(json.dumps(response, indent=2))

The word tokenization keeps alpha-numeric characters in lowercase, and splits them by whitespace. Accordingly, the search results include those where the question property contains the string Jeopardy!, which is the title of the TV show.

See the JSON response
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"question": "Capistrano swallows, Undeliverable mail, \"Jeopardy!\" champs",
"round": "Jeopardy!"
},
{
"question": "This opera star & \"Celebrity Jeopardy!\" contestant began life as Belle Silverman",
"round": "Double Jeopardy!"
}
]
}
}
}

Now, let's take a look at the same query, but with the field tokenization option.

field tokenization

In this example, the query string remains the same (Jeopardy), however we are now searching the round property, which is tokenized with the field tokenization option.

response = (
client.query
.get(
class_name="JeopardyQuestion",
properties=["question", "round"]
)
.with_bm25(
"Jeopardy",
properties=["round"]
)
.with_limit(2)
.do()
)

print(json.dumps(response, indent=2))

The field tokenization trims whitespace characters and then keeps the entire string as is. Accordingly, the search does not return any results, even though we know that round values include those such as Jeopardy! and Double Jeopardy!.

See the JSON response
{
"data": {
"Get": {
"JeopardyQuestion": []
}
}
}
Exercise

Try changing the query to Jeopardy!. What happens to the results?

Rules of thumb

The full list of tokenization options are word, whitespace, lowercase and field. A rule of thumb on when to use each option is to use word for long text where you want to retrieve partial matches, and field for short text where you only want to retrieve exact matches. The others are somewhere in between, and may be useful in specific situations, where for example you want case to matter (whitespace) or special characters to be respected (lowercase).

BM25F scoring

The exact algorithm used for scoring and ranking the results is called the BM25F algorithm. The details are beyond the scope of this course, but the gist is that the BM25F algorithm is a variant of the BM25 algorithm, where the F stands for "field". It is used to score and rank results based on the fields that are searched.

If you would like to delve into the details of the exact algorithm, you can review this Wikipedia page.

Review

  Question
What does a BM25 search do?
  Question
What does the `word` tokenization option do?

Key takeaways

  • BM25 search matches search terms between the query and data objects in the index and ranks results based on the frequency of those terms.
  • A BM25 query must include a query string, and can optionally include a list of properties to search, weights for each searched property, and a request for a score for each result.
  • BM25 queries are impacted by the tokenization of the properties being searched; for instance, word tokenization splits the query string into lowercase words and field tokenization searches for the entire query string.
  • Consider your search use case for picking tokenization options. For example, use word for long text with partial matches, and field for short text with exact matches.
  • BM25F scoring, where 'F' stands for 'field', is used to score and rank the search results based on the fields that are searched.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.