Tokenization and searches
You saw how tokenization affects filters. They impact keyword searches in a similar, but not identical, way. In this section, we'll see how different tokenization methods impact search results.
A hybrid search combines results from a keyword search and a vector search. Accordingly, tokenization impacts the keyword search part of a hybrid search, while the vector search part is not impacted by tokenization.
We will not separately discuss hybrid searches in this course. However, the impact on keyword searches discussed here will apply to the keyword search part of a hybrid search.
Impact on keyword searches
How tokenization impacts keyword searches
We will use a similar method as in the previous section, with a difference being that we will now perform a keyword search instead of a filter.
A keyword search ranks results using the BM25f algorithm. As a result, the impact of tokenization on keyword searches is twofold.
Firstly, tokenization will determine whether a result is included in the search results at all. If none of the tokens in the search query match any tokens in the object, the object will not be included in the search results.
Secondly, tokenization will impact the ranking of the search results. The BM25f algorithm takes into account the number of matching tokens, and the tokenization method will determine which tokens are considered matching.
Search setup
Each keyword query will look something like this.
We'll set up a reusable function to perform keyword searches, and display the top results along with their scores.
import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.collections import Collection
from typing import List
# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()
collection = client.collections.get("TokenizationDemo")
# Get property names
property_names = list()
for p in collection.config.get().properties:
property_names.append(p.name)
query_strings = ["<YOUR_QUERY_STRING>"]
def search_demo(collection: Collection, property_names: List[str], query_strings: List[str]):
from weaviate.classes.query import MetadataQuery
for query_string in query_strings:
print("\n" + "=" * 40 + f"\nBM25 search results for: '{query_string}'" + "\n" + "=" * 40)
for property_name in property_names:
response = collection.query.bm25(
query=query_string,
return_metadata=MetadataQuery(score=True),
query_properties=[property_name]
)
if len(response.objects) > 0:
print(f">> '{property_name}' search results")
for obj in response.objects:
print(obj.properties[property_name], round(obj.metadata.score, 3))
search_demo(collection, property_names, query_strings)
Examples
"Clark: "vs "clark" - messy text
Keyword searches are similarly impacted by tokenization as filters. However, there are subtle differences.
Take a look at this example, where we search for various combinations of substrings from the TV show title "Lois & Clark: The New Adventures of Superman"
.
The table shows whether the query matched the title, and the score:
word | lowercase | whitespace | field | |
---|---|---|---|---|
"clark" | 0.613 | ❌ | ❌ | ❌ |
"Clark" | 0.613 | ❌ | ❌ | ❌ |
"clark:" | 0.613 | 0.48 | ❌ | ❌ |
"Clark:" | 0.613 | 0.48 | 0.48 | ❌ |
"lois clark" | 1.226 | 0.48 | ❌ | ❌ |
"clark lois" | 1.226 | 0.48 | ❌ | ❌ |
Python query & output
search_demo(collection, property_names, ["clark", "Clark", "clark:", "Clark:", "lois clark", "clark lois"])
========================================
BM25 search results for: 'clark'
========================================
>> 'text_word' search results
Lois & Clark: The New Adventures of Superman 0.613
========================================
BM25 search results for: 'Clark'
========================================
>> 'text_word' search results
Lois & Clark: The New Adventures of Superman 0.613
========================================
BM25 search results for: 'clark:'
========================================
>> 'text_word' search results
Lois & Clark: The New Adventures of Superman 0.613
>> 'text_lowercase' search results
Lois & Clark: The New Adventures of Superman 0.48
========================================
BM25 search results for: 'Clark:'
========================================
>> 'text_word' search results
Lois & Clark: The New Adventures of Superman 0.613
>> 'text_lowercase' search results
Lois & Clark: The New Adventures of Superman 0.48
>> 'text_whitespace' search results
Lois & Clark: The New Adventures of Superman 0.48
========================================
BM25 search results for: 'lois clark'
========================================
>> 'text_word' search results
Lois & Clark: The New Adventures of Superman 1.226
>> 'text_lowercase' search results
Lois & Clark: The New Adventures of Superman 0.48
========================================
BM25 search results for: 'clark lois'
========================================
>> 'text_word' search results
Lois & Clark: The New Adventures of Superman 1.226
>> 'text_lowercase' search results
Lois & Clark: The New Adventures of Superman 0.48
Here, the same results are returned as in the filter example. However, note differences in the scores.
For example, a search for "lois clark"
returns a higher score than a search for "clark"
. This is because the BM25f algorithm considers the number of matching tokens. So, it would be beneficial to include as many matching tokens as possible in the search query.
Another difference is that a keyword search will return objects that match any of the tokens in the query. This is different from a filter, which is sensitive to the filtering operator. Depending on the desired result, you could use an "Equal"
operator, "ContainsAny"
, or "ContainsAll"
, for example.
The next section will demonstrate this, as well as how stop words are treated.
"A mouse" vs "mouse" - stop words
Here, we search for variants of the phrase "computer mouse", where some queries include additional words.
Now, take a look at the results.
Matches for "computer mouse"
word | lowercase | whitespace | field | |
---|---|---|---|---|
"computer mouse" | 0.889 | 0.819 | 1.01 | 0.982 |
"Computer Mouse" | 0.889 | 0.819 | ❌ | ❌ |
"a computer mouse" | 0.764 | 0.764 | 0.849 | ❌ |
"computer mouse pad" | 0.764 | 0.764 | 0.849 | ❌ |
Matches for "a computer mouse"
word | lowercase | whitespace | field | |
---|---|---|---|---|
"computer mouse" | 0.889 | 0.819 | 1.01 | ❌ |
"Computer Mouse" | 0.889 | 0.819 | ❌ | ❌ |
"a computer mouse" | 0.764 | 1.552 | 1.712 | 0.982 |
"computer mouse pad" | 0.764 | 0.688 | 0.849 | ❌ |
Python query & output
search_demo(collection, property_names, ["computer mouse", "a computer mouse", "the computer mouse", "blue computer mouse"])
========================================
BM25 search results for: 'computer mouse'
========================================
>> 'text_word' search results
mouse computer 0.889
Computer Mouse 0.889
computer mouse 0.889
a computer mouse 0.764
computer mouse pad 0.764
>> 'text_lowercase' search results
mouse computer 0.819
Computer Mouse 0.819
computer mouse 0.819
a computer mouse 0.688
computer mouse pad 0.688
>> 'text_whitespace' search results
mouse computer 1.01
computer mouse 1.01
a computer mouse 0.849
computer mouse pad 0.849
>> 'text_field' search results
computer mouse 0.982
========================================
BM25 search results for: 'a computer mouse'
========================================
>> 'text_word' search results
mouse computer 0.889
Computer Mouse 0.889
computer mouse 0.889
a computer mouse 0.764
computer mouse pad 0.764
>> 'text_lowercase' search results
a computer mouse 1.552
mouse computer 0.819
Computer Mouse 0.819
computer mouse 0.819
computer mouse pad 0.688
>> 'text_whitespace' search results
a computer mouse 1.712
mouse computer 1.01
computer mouse 1.01
computer mouse pad 0.849
>> 'text_field' search results
a computer mouse 0.982
========================================
BM25 search results for: 'the computer mouse'
========================================
>> 'text_word' search results
mouse computer 0.889
Computer Mouse 0.889
computer mouse 0.889
a computer mouse 0.764
computer mouse pad 0.764
>> 'text_lowercase' search results
mouse computer 0.819
Computer Mouse 0.819
computer mouse 0.819
a computer mouse 0.688
computer mouse pad 0.688
Lois & Clark: The New Adventures of Superman 0.48
>> 'text_whitespace' search results
mouse computer 1.01
computer mouse 1.01
a computer mouse 0.849
computer mouse pad 0.849
========================================
BM25 search results for: 'blue computer mouse'
========================================
>> 'text_word' search results
mouse computer 0.889
Computer Mouse 0.889
computer mouse 0.889
a computer mouse 0.764
computer mouse pad 0.764
>> 'text_lowercase' search results
mouse computer 0.819
Computer Mouse 0.819
computer mouse 0.819
a computer mouse 0.688
computer mouse pad 0.688
>> 'text_whitespace' search results
mouse computer 1.01
computer mouse 1.01
a computer mouse 0.849
computer mouse pad 0.849
The results here are similar to the filter example, but more nuanced and quite interesting!
Under word
tokenization, the search for computer mouse
produces identical results to the search for a computer mouse
. This is because the stop word a
is not considered in the search.
But note that the scores are different for returned objects where the only differences are stopwords, such as "computer mouse"
and "a computer mouse"
. This is because the BM25f algorithm does index stopwords, and they do impact the score.
As a user, you should keep this in mind, and you can configure the stop words in the collection definition to suit your desired behavior.
Another interesting note is that the lowercase
and whitespace
tokenization methods do not remove stop words in the query.
This behavior allows users who want to include stop words in their search queries to do so.
"variable_name" vs "variable name" - symbols
The table below shows keyword search results using the string "variable_name"
and the resulting scores.
word | lowercase | whitespace | field | |
---|---|---|---|---|
"variable_name" | 0.716 | 0.97 | 1.27 | 0.982 |
"Variable_Name:" | 0.716 | 0.97 | ❌ | ❌ |
"Variable Name:" | 0.716 | ❌ | ❌ | ❌ |
"a_variable_name" | 0.615 | ❌ | ❌ | ❌ |
"the_variable_name" | 0.615 | ❌ | ❌ | ❌ |
"variable_new_name" | 0.615 | ❌ | ❌ | ❌ |
Python query & output
search_demo(collection, property_names, ["variable_name"])
========================================
BM25 search results for: 'variable_name'
========================================
>> 'text_word' search results
Variable Name 0.716
Variable_Name 0.716
variable_name 0.716
variable_new_name 0.615
the_variable_name 0.615
a_variable_name 0.615
>> 'text_lowercase' search results
Variable_Name 0.97
variable_name 0.97
>> 'text_whitespace' search results
variable_name 1.27
>> 'text_field' search results
variable_name 0.982
These results are once again similar to the filter example. If your data contains symbols that are important to your search, you should consider using a tokenization method that preserves symbols, such as lowercase
or whitespace
.
Discussions
That's it for keyword searches and tokenization. Similarly to filters, the choice of tokenization method is a big part of your overall search strategy.
Our generally advice for tokenization in keyword searching is similar to our advice for filtering. Start with word
, and consider others such as lowercase
or whitespace
if symbols, or cases encode important information in your data.
Using field
tokenization may be too strict for keyword searches, as it will not match any
objects that do not contain the exact string in the exact order.
Lastly, keep in mind that keyword searches produce ranked results. Therefore, tokenization will not only affect the results set but also their ranking within the set.
With these considerations in mind, you can configure your tokenization strategy to best suit your data and your users' needs.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.