BM25 (Keyword) searches

Overview

A BM25 search is one implementation of what is commonly called a 'keyword' search. Broadly speaking, it works by matching the search terms between the query and the data objects in the index.

About `bm25` queries

How it works

When a user submits a BM25 query, Weaviate will look for objects that contain the search terms in the text properties of the objects. Then, it will rank the results based on how many times the search terms appear in the text properties of the objects.

In this way, a BM25 query is different to keyword-based filtering, which simply includes or excludes objects based on the provided set of conditions.

`bm25` query syntax

A BM25 query is shown below. Each BM25 query:

Must include a query string, which can be any length,
Can optionally include a list of properties to search,
Can optionally include weights for each searched property, and
Can optionally request a score for each result.

Python

response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_bm25(
      query="food",  # Query string
      properties=["question^2", "answer"]  # Searched properties, including boost for `question`
    )
    .with_additional("score")  # Include score in the response
    .with_limit(3)
    .do()
  )

print(json.dumps(response, indent=2))

API docs

The above query will return the top 3 objects based on its BM25F score, based on the query string "food". The query will search the question and answer properties of the objects, from which question property will be boosted by a factor of 3.

See the JSON response

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "4.0038033"
          },
          "answer": "cake",
          "question": "Devil's food & angel food are types of this dessert"
        },
        {
          "_additional": {
            "score": "3.8706005"
          },
          "answer": "a closer grocer",
          "question": "A nearer food merchant"
        },
        {
          "_additional": {
            "score": "3.2457707"
          },
          "answer": "food stores (supermarkets)",
          "question": "This type of retail store sells more shampoo & makeup than any other"
        }
      ]
    }
  }
}

API docs

Exercise

Try varying the boost factor, and the query string. What happens to the results?

Tokenization and `bm25` searches

Why tokenization matters

In an earlier unit, we briefly discussed the inverted index, and that it stores a "tokenized" index of data.

When a BM25 query is submitted, Weaviate will search each property according to its tokenization property. For example, if a property is tokenized with the word tokenization option, it will tokenize the query string into its constituent, lowercase, words, and search for each word in the index. On the other hand, if a property uses a field tokenization, Weaviate will look for the entire query string in the index.

This applies to the inverted index only

This is different to any tokenization in the context of, for example, language models or vectorization models. Tokenization in the context of the current section only applies to the inverted index.

More concretely, let's take a look at some examples.

`word` tokenization

In this example, we search through the question property with the query string Jeopardy. The question property is tokenized with the word tokenization option.

Python

response = (
    client.query
    .get(
        class_name="JeopardyQuestion",
        properties=["question", "round"]
    )
    .with_bm25(
        "Jeopardy",
        properties=["question"]
    )
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=2))

API docs

The word tokenization keeps alpha-numeric characters in lowercase, and splits them by whitespace. Accordingly, the search results include those where the question property contains the string Jeopardy!, which is the title of the TV show.

See the JSON response

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "question": "Capistrano swallows, Undeliverable mail, \"Jeopardy!\" champs",
          "round": "Jeopardy!"
        },
        {
          "question": "This opera star & \"Celebrity Jeopardy!\" contestant began life as Belle Silverman",
          "round": "Double Jeopardy!"
        }
      ]
    }
  }
}

API docs

Now, let's take a look at the same query, but with the field tokenization option.

`field` tokenization

In this example, the query string remains the same (Jeopardy), however we are now searching the round property, which is tokenized with the field tokenization option.

Python

response = (
    client.query
    .get(
        class_name="JeopardyQuestion",
        properties=["question", "round"]
    )
    .with_bm25(
        "Jeopardy",
        properties=["round"]
    )
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=2))

API docs

The field tokenization trims whitespace characters and then keeps the entire string as is. Accordingly, the search does not return any results, even though we know that round values include those such as Jeopardy! and Double Jeopardy!.

See the JSON response

{
  "data": {
    "Get": {
      "JeopardyQuestion": []
    }
  }
}

API docs

Exercise

Try changing the query to Jeopardy!. What happens to the results?

Rules of thumb

The full list of tokenization options are word, whitespace, lowercase and field. A rule of thumb on when to use each option is to use word for long text where you want to retrieve partial matches, and field for short text where you only want to retrieve exact matches. The others are somewhere in between, and may be useful in specific situations, where for example you want case to matter (whitespace) or special characters to be respected (lowercase).

BM25F scoring

The exact algorithm used for scoring and ranking the results is called the BM25F algorithm. The details are beyond the scope of this course, but the gist is that the BM25F algorithm is a variant of the BM25 algorithm, where the F stands for "field". It is used to score and rank results based on the fields that are searched.

If you would like to delve into the details of the exact algorithm, you can review this Wikipedia page.

Review

Question

What does a BM25 search do?

Question

What does the `word` tokenization option do?

Key takeaways

BM25 search matches search terms between the query and data objects in the index and ranks results based on the frequency of those terms.
A BM25 query must include a query string, and can optionally include a list of properties to search, weights for each searched property, and a request for a score for each result.
BM25 queries are impacted by the tokenization of the properties being searched; for instance, word tokenization splits the query string into lowercase words and field tokenization searches for the entire query string.
Consider your search use case for picking tokenization options. For example, use word for long text with partial matches, and field for short text with exact matches.
BM25F scoring, where 'F' stands for 'field', is used to score and rank the search results based on the fields that are searched.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Overview​

About bm25 queries​

How it works​

bm25 query syntax​

Tokenization and bm25 searches​

Why tokenization matters​

word tokenization​

field tokenization​

Rules of thumb​

BM25F scoring​

Review​

Key takeaways​

Questions and feedback​

Overview

About `bm25` queries

How it works

`bm25` query syntax

Tokenization and `bm25` searches

Why tokenization matters

`word` tokenization

`field` tokenization

Rules of thumb

BM25F scoring

Review

Key takeaways

Questions and feedback