Skip to main content

275 (Keyword) Tokenization

Course overview

Pre-requisites

This course is self-contained. However, we recommend that you go through one of the 101-level courses, such as that for working with text, your own vectors, or multimodal data.

This course will introduce you to tokenization, and how it relates to Weaviate. Specifically, it will discuss what it is, how it relates to search and how to configure it.

Note that tokenization is a concept that applies to keyword search and filtering, as well as in the context of language models.

This course focuses on the keyword aspect, but will briefly discuss how tokenization impacts language models.

Learning objectives

  Here, we will cover:

Learning Goals
  • What tokenization is, and why it is required.

  By the time you are finished, you will be able to:

Learning Outcomes
  • Identify tokenized text from raw text.
  • Name different tokenization options in Weaviate.
  • Select an appropriate tokenization option for a given use case.
  • Name languages for which specific tokenization options are available.

Units

1. Overview of tokenization

Theory

What is tokenization, and why is it important?

2. Available tokenization options

Theory

What tokenization options are available in Weaviate?

3. Tokenization and filters

Practical

See how tokenization impacts filters.

4. Tokenization and searches

Practical

See how tokenization impacts searches.