Skip to main content

Tokenization and filters

Now that you've learned about different tokenization methods, let's put them into practice. In this section, you'll see how tokenization impacts filters.

Preparation

For this section, we'll work with an actual Weaviate instance to see how different tokenization methods impact filtering results.

We are going to use a very small, custom dataset for demonstration purposes.

collection = client.collections.get("TokenizationDemo")

phrases = [
# string with special characters
"Lois & Clark: The New Adventures of Superman",

# strings with stopwords & varying orders
"computer mouse",
"Computer Mouse",
"mouse computer",
"computer mouse pad",
"a computer mouse",

# strings without spaces
"variable_name",
"Variable_Name",
"Variable Name",
"a_variable_name",
"the_variable_name",
"variable_new_name",
]

To follow along, you can use the following Python code to add this data to your Weaviate instance.

Steps to create a collection

We will create a simple object collection, with each object containing multiple properties. Each properties will contain the same text, but with different tokenization methods applied.

import weaviate
from weaviate.classes.config import Property, DataType, Tokenization, Configure

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_wcs(...) or
# client = weaviate.connect_to_local()

tkn_options = [
Tokenization.WORD,
Tokenization.LOWERCASE,
Tokenization.WHITESPACE,
Tokenization.FIELD,
]

# Create a property for each tokenization option
properties = [
Property(
name=f"text_{tokenization}",
data_type=DataType.TEXT,
tokenization=tokenization
) for tokenization in tkn_options # Note the list comprehension
]


client.collections.create(
name="TokenizationDemo",
properties=properties,
vectorizer_config=Configure.Vectorizer.none()
)

client.close()

Note that we do not add object vectors in this case, as we are only interested in the impact of tokenization on filters (and keyword searches).

Steps to add objects

Now, we add objects to the collection, repeating text objects as properties.

import weaviate

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_wcs(...) or
# client = weaviate.connect_to_local()

collection = client.collections.get("TokenizationDemo")

# Get property names
property_names = [p.name for p in collection.config.get().properties]

phrases = [
# string with special characters
"Lois & Clark: The New Adventures of Superman",

# strings with stopwords & varying orders
"computer mouse",
"Computer Mouse",
"mouse computer",
"computer mouse pad",
"a computer mouse",

# strings without spaces
"variable_name",
"Variable_Name",
"Variable Name",
"a_variable_name",
"the_variable_name",
"variable_new_name",
]

for phrase in phrases:
obj_properties = {name: phrase for name in property_names}
print(obj_properties)
collection.data.insert(properties=obj_properties)

client.close()

Impact on filters

Now that we have added a set of objects to Weaviate, let's see how different tokenization methods impact filtered retrieval.

Each filtered query will look something like this, wherein we filter the objects for a set of query strings.

We'll set up a reusable function to filter objects based on a set of query strings. Remember that a filter is binary: it either matches or it doesn't.

The function will return a list of matched objects, and print the matched objects for us to see.

import weaviate
from weaviate.classes.query import Filter

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_wcs(...) or
# client = weaviate.connect_to_local()

collection = client.collections.get("TokenizationDemo")

# Get property names
property_names = [p.name for p in collection.config.get().properties]

query_strings = ["<YOUR_QUERY_STRING>"]


def filter_demo(query_strings: list[str]):
for query_string in query_strings:
print("\n" + "=" * 40 + f"\nHits for: '{query_string}'" + "\n" + "=" * 40)
for property_name in property_names:
response = collection.query.fetch_objects(
filters=Filter.by_property(property_name).equal(query_string),
)
if len(response.objects) > 0:
print(f">> '{property_name}' matches")
for obj in response.objects:
print(obj.properties[property_name])


filter_demo(query_strings)

"Clark: "vs "clark" - messy text

Typical text is often messy, with punctuations, mixed cases and other irregularities. Take a look at this example, where we filter for various combinations of substrings from the TV show title "Lois & Clark: The New Adventures of Superman".

The table shows whether the query matched the title:

wordlowercasewhitespacefield
"clark"
"Clark"
"clark:"
"Clark:"
"lois clark"
"clark lois"
Python query & output
filter_demo(["clark", "Clark", "clark:", "Clark:", "lois clark", "clark lois"])
========================================
Hits for: 'clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'Clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'clark:'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman
>> 'text_lowercase' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'Clark:'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman
>> 'text_lowercase' matches
Lois & Clark: The New Adventures of Superman
>> 'text_whitespace' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'lois clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'clark lois'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

Note how the word tokenization was the only that consistently returned the matching title, unless the colon (:) was included in the query. This is because the word tokenization method treats the colon as a separator.

Users may not be expected to include any punctuation in their queries, nor the exact capitalization. As a result, for a typical text filter usage, the word tokenization method is a good starting point.

"A mouse" vs "mouse" - stop words

Here, we filter for variants of the phrase "computer mouse", where some queries include additional words.

Now, take a look at the results.

Matches for "computer mouse"

wordlowercasewhitespacefield
"computer mouse"
"a computer mouse"
"the computer mouse:"
"blue computer mouse"

Matches for "a computer mouse"

wordlowercasewhitespacefield
"computer mouse"
"a computer mouse"
"the computer mouse:"
"blue computer mouse"
Python query & output
filter_demo(["computer mouse", "a computer mouse", "the computer mouse", "blue computer mouse"])
========================================
Hits for: 'computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_field' matches
computer mouse

========================================
Hits for: 'a computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_field' matches
a computer mouse

========================================
Hits for: 'the computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse

========================================
Hits for: 'blue computer mouse'
========================================

The results indicate that adding the word "a" or "the" to the query does not impact the filter results for all methods except field. This is because at every tokenization method, the word "a" or "the" is considered a stop word and is ignored.

With the field method, the difference is that stop word tokens like "a" or "the" are never produced. An input "a computer mouse" is tokenized to ["a computer mouse"], containing one token.

Adding another word, such as "blue", that is not a stop word, causes the query to not match any objects.

"variable_name" vs "variable name" - symbols

The word tokenization is a good default. However, it may not always be the best choice. Take a look at this example where we filter for different variants of "variable_name", to see if they match the object with the exact string ("variable_name").

wordlowercasewhitespacefield
"variable_name"
"Variable_Name:"
"Variable Name:"
"a_variable_name"
"the_variable_name"
"variable_new_name"
Python query & output
filter_demo(["variable_name"])
========================================
Hits for: 'variable_name'
========================================
>> 'text_word' matches
variable_name
Variable_Name
Variable Name
a_variable_name
the_variable_name
variable_new_name
>> 'text_lowercase' matches
variable_name
Variable_Name
>> 'text_whitespace' matches
variable_name
>> 'text_field' matches
variable_name

What is the desired behavior here? Should a filter for "variable name" match the object with the property "variable_name"?

What about a filter for "variable_new_name"? If the goal is to look through, say, a code base, the user might not expect a filter for "variable_new_name" to match "variable_name".

In cases such as these, where symbols are important to your data, you should consider using a tokenization method that preserves symbols, such as lowercase or whitespace.

Discussions

We've discussed how different tokenization methods impact filters.

For most filtering use, the word tokenization method is a good starting point. It is case-insensitive, and treats most symbols as separators.

However, if symbols are important to your data, or if you need to distinguish between different cases, you may want to consider using a different tokenization method.

And what about field tokenization? This method is most useful when you have text that should be treated as a single token. This is useful for properties like email addresses, URLs, or identifiers.

A typical filtering strategy with a field tokenization method might involve exact matches, or partial matches with wildcards. Do note, however, that wildcard-based filters can be computationally expensive (slow) - so use them judiciously.

Next, we'll discuss how tokenization impacts keyword searches.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.