Skip to main content

Chunking long texts

Unit overview

Preview unit

This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.


Chunking is an important concept in the world of vector databases and language models. Although we've looked at relatively small pieces of text in previous units, real-world text data can be much longer.

Think about lengths of articles, transcripts, or even books. Instead of a few words, these texts can be thousands, or tens of thousands of words long if not longer. The Lord of the Rings, for example, is over 500,000 words long!

Chunking splits texts like these into smaller pieces of texts, i.e. "chunks", before storing them in a vector database, or passing them to a language model.

This can seem relatively innocuous at first. Like deciding where to split a sentence or a paragraph into two. But the choice of chunking strategy can have a significant impact on the final results.

Although this may seem glance, chunking decisions can significantly impact the search performance and behavior of vector databases as well as outputs language models. This unit covers this seemingly simple, but nuanced topic from the perspective of a user. We will begin by covering what chunking is, and why it is used.

We will then move on to cover various chunking methods before discussing key considerations when deciding on a chunking strategy, as well as some suggested starting points.

By the end of this unit, you will have a good understanding of chunking in general, and be able to implement some solid chunking strategies based on your actual needs.

Prerequisites

Learning objectives

  What are these?
  Each unit includes a set of Learning Goals and Learning Outcomes which form the unit's guiding principles.
  • Learning Goals describe the unit's key topics and ideas.
  • Learning Outcomes on the other hand describe tangible skills that the learner should be able to demonstrate

  Here, we will cover:

Learning Goals
  • What chunking is
  • Its role in vector search and generative search
  • Various chunking methods
  • Key considerations and suggested starting points

  By the time you are finished, you will be able to:

Learning Outcomes
  • Describe chunking at a high level
  • Explain the impact of chunking in vector search and retrieval augmented generation
  • Implement various chunking methods and know where to explore others
  • Evaluate chunking strategies based on your needs

Questions and feedback

If you have any questions or feedback, please let us know on our forum. For example, you can: