This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.
Chunking is an important concept in the world of vector databases and language models. Although we've looked at relatively small pieces of text in previous units, real-world text data can be much longer.
Think about lengths of articles, transcripts, or even books. Instead of a few words, these texts can be thousands, or tens of thousands of words long if not longer. The Lord of the Rings, for example, is over 500,000 words long!
Chunking splits texts like these into smaller pieces of texts, i.e. "chunks", before storing them in a vector database, or passing them to a language model.
This can seem relatively innocuous at first. Like deciding where to split a sentence or a paragraph into two. But the choice of chunking strategy can have a significant impact on the final results.
Although this may seem glance, chunking decisions can significantly impact the search performance and behavior of vector databases as well as outputs language models. This unit covers this seemingly simple, but nuanced topic from the perspective of a user. We will begin by covering what chunking is, and why it is used.
We will then move on to cover various chunking methods before discussing key considerations when deciding on a chunking strategy, as well as some suggested starting points.
By the end of this unit, you will have a good understanding of chunking in general, and be able to implement some solid chunking strategies based on your actual needs.
- (Required) A Python (3) environment with
- (Required) Complete 101A Weaviate Academy Preparation
- (Recommended) Complete Hello, Weaviate
- (Recommended) Complete Queries 1
- (Recommended) Complete Schema and Imports
What are these?
- Learning Goals describe the unit's key topics and ideas.
- Learning Outcomes on the other hand describe tangible skills that the learner should be able to demonstrate
Here, we will cover:Learning Goals
- What chunking is
- Its role in vector search and generative search
- Various chunking methods
- Key considerations and suggested starting points
By the time you are finished, you will be able to:Learning Outcomes
- Describe chunking at a high level
- Explain the impact of chunking in vector search and retrieval augmented generation
- Implement various chunking methods and know where to explore others
- Evaluate chunking strategies based on your needs