Keyword (BM25) search
Overview
This page shows you how to perform keyword searches.
A keyword search looks for objects that contain the search terms in their properties according to the selected tokenization. The results are scored according to the BM25F function, and thus a keyword search is also called the bm25
search in Weaviate. It is also called a sparse vector search.
Starting in v1.19.0
, BM25 search has improved accuracy for schemas with a large number of properties and for zero-length properties. If you are using an earlier version, please upgrade.
Basic BM25 search
To use BM25 search, you must provide a search string as a minimum.
This example uses default settings to look for objects containing the keyword food
anywhere in the object.
It ranks the results using BM25F, and returns the top 3.
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
limit=3
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
response = (
client.query
.get("JeopardyQuestion", ["question", "answer"])
.with_bm25(
query="food"
)
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'food',
})
.withLimit(3)
.withFields('question answer')
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
}
) {
question
answer
}
}
}
Example response
It should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
For additional details, see the BM25 API Reference
Score
The score
sub-property is the BM25F score used to rank the outputs.
To retrieve sub-properties with one of the legacy clients, use the _additional
property to specify score
. The new Python client returns the information as metadata.
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
import weaviate.classes as wvc
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
return_metadata=wvc.MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
print(o.metadata.score)
response = (
client.query
.get("JeopardyQuestion", ["question", "answer"])
.with_bm25(
query="food"
)
.with_additional("score")
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'food',
})
.withFields('question answer _additional { score }')
.withLimit(3)
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
It should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"_additional": {
"score": "2.8725255"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "2.7672548"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
Limit the results
You can limit the number of result objects that a bm25
search returns.
- For a fixed number of objects, use the
limit: <N>
operator. - For groups based on discontinuities in
score
, use theautocut
operator.
You can combine autocut
with limit: N
to limit the size of the autocut groups to the first N
objects.
Limiting the number of results
Use the limit
argument to specify the maximum number of results that should be returned:
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
import weaviate.classes as wvc
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
limit=3
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
response = (
client.query
.get('JeopardyQuestion', ['question', 'answer'])
.with_bm25(
query='safety'
)
.with_additional('score')
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'safety',
})
.withFields('question answer _additional { score }')
.withLimit(3)
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
bm25: {
query: "safety"
}
limit: 3
) {
question
answer
_additional {
score
}
}
}
}
Autocut
Weaviate can also limit results based on discontinuities in the result set. In the legacy client, this filter is called autocut
. The filter is called auto_limit
in the new client.
The filter looks for discontinuities, or jumps, in the result score. In your query, you specify how many jumps there should be. The query stops returning results after the specified number of jumps.
For example, consider a bm25
that returns these distances [2.676, 2.021, 2.022, 1.854, 1.856, 1.713]
. There is a significant break between 2.676 and 2.021. There is another significant break between 2.022 and 1.854. Autocut uses the number of breaks to return data groups.
autocut: 1
returns one object,[2.676]
.autocut: 2
returns three objects,[2.676, 2.021, 2.022]
.autocut: 3
returns all objects,[2.676, 2.021, 2.022, 1.854, 1.856, 1.713]
.
Autocut can be used as follows:
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
auto_limit=1
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
response = (
client.query
.get('JeopardyQuestion', ['question', 'answer'])
.with_bm25(
query='safety'
)
.with_additional('score')
.with_autocut(1)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'safety',
})
.withFields('question answer _additional { score }')
.withAutocut(1)
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
bm25: {
query: "safety"
}
autocut: 1
) {
question
answer
_additional {
score
}
}
}
}
Example response
It should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "2.6768136"
},
"answer": "OSHA (Occupational Safety and Health Administration)",
"question": "The government admin. was created in 1971 to ensure occupational health & safety standards"
}
]
}
}
}
Selected properties only
Starting in v1.19.0, you can specify the object properties
to search.
This example searches for objects that have the keyword food
, but only when food
is in the question
property. The query uses the BM25F scores of the searched property to rank the objects it finds. It returns the top three objects.
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
query_properties=["question"],
return_metadata=wvc.MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
print(o.metadata.score)
response = (
client.query
.get("JeopardyQuestion", ["question", "answer"])
.with_bm25(
query="food",
properties=["question"]
)
.with_additional("score")
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'food',
properties: ['question'],
})
.withLimit(3)
.withFields('question answer _additional { score }')
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
properties: ["question"]
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
It should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.7079012"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.4311616"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "2.8312314"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera"
}
]
}
}
}
Use weights to boost properties
You can specify weights for properties
to change how much the property affects the overall BM25F score.
This example searches for objects that contain the keyword food
in the question
property and the answer
property. The weight of the question
property is boosted (^2
). The query returns the top three objects.
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
query_properties=["question^2", "answer"],
limit=3
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
response = (
client.query
.get("JeopardyQuestion", ["question", "answer"])
.with_bm25(
query="food",
properties=["question^2", "answer"]
)
.with_additional("score")
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'food',
properties: ['question^2', 'answer'],
})
.withLimit(3)
.withFields('question answer _additional { score }')
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
properties: ["question^2", "answer"]
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
It should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}
Tokenization
The BM25 query string is tokenized before it is used to search for objects using the inverted index. Due to the nature of BM25 scoring, Weaviate will return any object that matched at least one of the tokens.
This example returns objects that contain either food
or wine
in the question
property, and ranks them using BM25F scores.
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food wine", # search for food or wine
query_properties=["question"],
return_properties=["question"], # only return question property
limit=5
)
for o in response.objects:
print(o.properties["question"])
response = (
client.query
.get('JeopardyQuestion', ['question'])
.with_bm25(
query='food wine',
properties=['question']
)
.with_additional('score')
.with_limit(5)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'food wine',
properties: ['question'],
})
.withLimit(5)
.withFields('question _additional { score }')
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
limit: 5
bm25: {
query: "food wine"
properties: ["question"]
}
) {
question
_additional {
score
}
}
}
}
Example response
The query should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.4707017"
},
"question": "Wine, a ship, Croce's time"
},
{
"_additional": {
"score": "3.7450757"
},
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.647569"
},
"question": "Type of event in Cana at which Jesus turned water into wine"
},
{
"_additional": {
"score": "3.4594069"
},
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.3400855"
},
"question": "Sparkling wine sold under the name Champagne must come from this region in Northeast France"
}
]
}
}
}
Add a conditional (where
) filter
You can add a conditional filter to any BM25 search query. The filter parses the outputs but does not impact the ranking.
These examples perform a BM25 search for food
in any field. The search filters on objects that have the round
property set to Double Jeopardy!
.
To filter with one of the legacy clients, use with_where
. The new Python client uses the Filter
class from weaviate.classes
.
- Python (v4)
- Python (v3)
- JavaScript/TypeScript
- GraphQL
import weaviate.classes as wvc
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
filters=wvc.Filter("round").equal("Double Jeopardy!"),
return_properties=["answer", "question", "round"], # return these properties
limit=3
)
for o in response.objects:
print(json.dumps(o.properties, indent=2))
response = (
client.query
.get("JeopardyQuestion", ["question", "answer", "round"])
.with_bm25(
query="food"
)
.with_where({
"path": ["round"],
"operator": "Equal",
"valueText": "Double Jeopardy!"
})
.with_additional("score")
.with_limit(3)
.do()
)
print(json.dumps(response, indent=2))
result = await client.graphql
.get()
.withClassName('JeopardyQuestion')
.withBm25({
query: 'food',
})
.withWhere({
path: ['round'],
operator: 'Equal',
valueText: 'Double Jeopardy!',
})
.withLimit(3)
.withFields('question answer round _additional { score }')
.do();
console.log(JSON.stringify(result, null, 2));
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
}
where: {
path: ["round"]
operator: Equal
valueText: "Double Jeopardy!"
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
It should produce a response like the one below:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.9633813"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.6719631"
},
"answer": "pseudopods",
"question": "Amoebas use temporary extensions called these to move or to surround & engulf food",
"round": "Double Jeopardy!"
}
]
}
}
}