Available tokenization options
Weaviate offers a variety of tokenization options to choose from. These options allow you to configure how keyword searches and filters are performed in Weaviate for each property.
The main options are:
word
: alphanumeric, lowercased tokenslowercase
: lowercased tokenswhitespace
: whitespace-separated, case-sensitive tokensfield
: the entire value of the property is treated as a single token
Let's explore each of these options in more detail, including how they work and when you might want to use them.
Tokenization methods
word
The word
tokenization method splits the text by any non-alphanumeric characters, and then lowercases each token.
Here are some examples of how the word
tokenization method works:
Text | Tokens |
---|---|
"Why, hello there!" | ["why", "hello", "there"] |
"Lois & Clark: The New Adventures of Superman" | ["lois", "clark", "the", "new", "adventures", "of", "superman"] |
"variable_name" | ["variable", "name"] |
"Email: john.doe@example.com" | ["email", "john", "doe", "example", "com"] |
When to use word
tokenization
The word
tokenization is the default tokenization method in Weaviate.
Generally, if you are searching or filtering "typical" text data, word
tokenization is a good starting point.
But if symbols (such as &
, @
or _
) are important to your data and search, or distinguishing between different cases is important, you may want to consider using a different tokenization method such as lowercase
or whitespace
.
lowercase
The lowercase
tokenization method splits the text by whitespace, and then lowercases each token.
Here are some examples of how the lowercase
tokenization method works:
Text | Tokens |
---|---|
"Why, hello there!" | ["why,", "hello", "there!"] |
"Lois & Clark: The New Adventures of Superman" | ["lois", "&", "clark:", "the", "new", "adventures", "of", "superman"] |
"variable_name" | ["variable_name"] |
"Email: john.doe@example.com" | ["email:", "john.doe@example.com"] |
When to use lowercase
tokenization
The lowercase
tokenization can be thought of as word
, but including symbols. A key use case for lowercase
is when symbols such as &
, @
or _
are significant for your data.
This might include cases where your database contains code snippets, email addresses, or any other symbolic notations with meaning.
As an example, consider filtering for objects containing "database_address"
:
Text | Tokenization | Matched by "database_address" |
---|---|---|
"database_address" | word | ✅ |
"database_address" | lowercase | ✅ |
"database_company_address" | word | ✅ |
"database_company_address" | lowercase | ❌ |
Note how the filtering behavior changes. A careful choice of tokenization method can ensure that the search results meet your and the users' expectations.
whitespace
The whitespace
tokenization method splits the text by whitespace.
Here are some examples of how the whitespace
tokenization method works:
Text | Tokens |
---|---|
"Why, hello there!" | ["Why,", "hello", "there!"] |
"Lois & Clark: The New Adventures of Superman" | ["Lois", "&", "Clark:", "The", "New", "Adventures", "of", "Superman"] |
"variable_name" | ["variable_name"] |
"Email: john.doe@example.com" | ["Email:", "john.doe@example.com"] |
When to use whitespace
tokenization
The whitespace
tokenization method adds case-sensitivity to lowercase
. This is useful when your data distinguishes between cases, such as for names of entities or acronyms.
A risk of using whitespace
tokenization is that it can be too strict. For example, a search for "superman"
will not match "Superman"
, as the tokens are case-sensitive.
But this could be managed on a case-by-case basis. It would be possible to construct queries that are case-insensitive, such as by having the query create two versions of the search term: one in lowercase and one in uppercase.
On the other hand, it will not be possible to construct case-sensitive queries using word
or lowercase
tokenization.
field
The field
tokenization method simply treats the entire value of the property as a single token.
Here are some examples of how the field
tokenization method works:
Text | Tokens |
---|---|
"Why, hello there!" | ["Why, hello there!"] |
"Lois & Clark: The New Adventures of Superman" | ["Lois & Clark: The New Adventures of Superman"] |
"variable_name" | ["variable_name"] |
"Email: john.doe@example.com" | ["Email: john.doe@example.com"] |
When to use field
tokenization
The field
tokenization is useful when exact matches of strings in the exact order are important. Typically, this is useful for properties that contain unique identifiers, such as email addresses, URLs, or other unique strings.
Generally, field
tokenization should be used judiciously due to its strictness.
For keyword searches, field
tokenization has limited use. A keyword search for "computer mouse"
will not match "mouse for a computer"
, nor will it match "computer mouse pad"
or even "a computer mouse"
.
Stop words
Weaviate supports stop words. Stop words are common words which are often filtered out from search queries because they occur frequently and do not carry much meaning.
By default, Weaviate uses a list of English stop words. You can configure your own list of stop words in the schema definition.
This means that after tokenization, any stop words in the text behave as if they were not present. For example, a filter for "a computer mouse"
will behave identically to a filter for "computer mouse"
.
Language-specific tokenization
The above tokenization methods work well for English, or other languages that use spaces to separate words.
However, not all languages rely on spaces to define natural semantic boundaries. For languages like Japanese, Chinese or Korean, where words are not separated by spaces, you may need to use a different tokenization method.
Weaviate provides gse
and trigram
(from v1.24
) and kagome_kr
(from v1.25.7
) tokenization methods for this reason.
gse
implements the "Jieba" algorithm, which is a popular Chinese text segmentation algorithm. trigram
splits text into all possible trigrams, which can be useful for languages like Japanese.
kagome_kr
uses the Kagome
tokenizer with a Korean MeCab (mecab-ko-dic) dictionary to split the property text. This is useful for Korean text.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.