Skip to main content

Keyword (BM25) search

Overview

This page shows you how to perform keyword searches.

A keyword search looks for objects that contain the search terms in their properties according to the selected tokenization. The results are scored according to the BM25F function, and thus a keyword search is also called the bm25 search in Weaviate. It is also called a sparse vector search.

Search Accuracy

Starting in v1.19.0, BM25 search has improved accuracy for schemas with a large number of properties and for zero-length properties. If you are using an earlier version, please upgrade.

To use BM25 search, you must provide a search string as a minimum.

This example uses default settings to look for objects containing the keyword food anywhere in the object.

It ranks the results using BM25F, and returns the top 3.

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
limit=3
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))
Example response

It should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}

For additional details, see the BM25 API Reference

Score

The score sub-property is the BM25F score used to rank the outputs.

To retrieve sub-properties with one of the legacy clients, use the _additional property to specify score. The new Python client returns the information as metadata.

import weaviate.classes as wvc

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
return_metadata=wvc.MetadataQuery(score=True),
limit=3
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))
print(o.metadata.score)
Example response

It should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"_additional": {
"score": "2.8725255"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "2.7672548"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}

Limit the results

You can limit the number of result objects that a bm25 search returns.

  • For a fixed number of objects, use the limit: <N> operator.
  • For groups based on discontinuities in score, use the autocut operator.

You can combine autocut with limit: N to limit the size of the autocut groups to the first N objects.

Limiting the number of results

Use the limit argument to specify the maximum number of results that should be returned:

import weaviate.classes as wvc

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
limit=3
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))

Autocut

Weaviate can also limit results based on discontinuities in the result set. In the legacy client, this filter is called autocut. The filter is called auto_limit in the new client.

The filter looks for discontinuities, or jumps, in the result score. In your query, you specify how many jumps there should be. The query stops returning results after the specified number of jumps.

For example, consider a bm25 that returns these distances [2.676, 2.021, 2.022, 1.854, 1.856, 1.713]. There is a significant break between 2.676 and 2.021. There is another significant break between 2.022 and 1.854. Autocut uses the number of breaks to return data groups.

  • autocut: 1 returns one object, [2.676].
  • autocut: 2 returns three objects, [2.676, 2.021, 2.022].
  • autocut: 3 returns all objects, [2.676, 2.021, 2.022, 1.854, 1.856, 1.713].

Autocut can be used as follows:

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
auto_limit=1
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))
Example response

It should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "2.6768136"
},
"answer": "OSHA (Occupational Safety and Health Administration)",
"question": "The government admin. was created in 1971 to ensure occupational health & safety standards"
}
]
}
}
}

Selected properties only

Starting in v1.19.0, you can specify the object properties to search.

This example searches for objects that have the keyword food, but only when food is in the question property. The query uses the BM25F scores of the searched property to rank the objects it finds. It returns the top three objects.

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
query_properties=["question"],
return_metadata=wvc.MetadataQuery(score=True),
limit=3
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))
print(o.metadata.score)
Example response

It should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.7079012"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.4311616"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "2.8312314"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera"
}
]
}
}
}

Use weights to boost properties

You can specify weights for properties to change how much the property affects the overall BM25F score.

This example searches for objects that contain the keyword food in the question property and the answer property. The weight of the question property is boosted (^2). The query returns the top three objects.

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
query_properties=["question^2", "answer"],
limit=3
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))
Example response

It should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}

Tokenization

The BM25 query string is tokenized before it is used to search for objects using the inverted index. Due to the nature of BM25 scoring, Weaviate will return any object that matched at least one of the tokens.

This example returns objects that contain either food or wine in the question property, and ranks them using BM25F scores.

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food wine", # search for food or wine
query_properties=["question"],
return_properties=["question"], # only return question property
limit=5
)

for o in response.objects:
print(o.properties["question"])
Example response

The query should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.4707017"
},
"question": "Wine, a ship, Croce's time"
},
{
"_additional": {
"score": "3.7450757"
},
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.647569"
},
"question": "Type of event in Cana at which Jesus turned water into wine"
},
{
"_additional": {
"score": "3.4594069"
},
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.3400855"
},
"question": "Sparkling wine sold under the name Champagne must come from this region in Northeast France"
}
]
}
}
}

Add a conditional (where) filter

You can add a conditional filter to any BM25 search query. The filter parses the outputs but does not impact the ranking.

These examples perform a BM25 search for food in any field. The search filters on objects that have the round property set to Double Jeopardy!.

To filter with one of the legacy clients, use with_where. The new Python client uses the Filter class from weaviate.classes.

import weaviate.classes as wvc

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
filters=wvc.Filter("round").equal("Double Jeopardy!"),
return_properties=["answer", "question", "round"], # return these properties
limit=3
)

for o in response.objects:
print(json.dumps(o.properties, indent=2))
Example response

It should produce a response like the one below:

{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.9633813"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.6719631"
},
"answer": "pseudopods",
"question": "Amoebas use temporary extensions called these to move or to surround & engulf food",
"round": "Double Jeopardy!"
}
]
}
}
}