Skip to main content

A brief introduction to chunking

Preview unit

This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.

What is chunking?

Chunking is the pre-processing step of splitting texts into smaller pieces of texts, i.e. "chunks".

You know by now that a vector database stores objects by corresponding vectors to capture their meaning. But just how much text does each vector capture the meaning of? Chunking defines this. Each chunk is the unit of information that is vectorized and stored in the database.

Consider a case where the source text comprises a set of books. A chunking method could conceivably split the text into a set of chapters, paragraphs, sentences, or even words, into chunks.

While this is a simple concept, it can have a significant impact on the performance of vector databases, and outputs of language models.

Why chunk data?

For information retrieval

Let's go back to the above example, building a vector database from a set of books.

At one extreme, you could catalog each book as one vector. This would build a database that's similar to a library catalog.

It would be good for finding books that are most similar to the query string. But because each vector captures a book, it is unlikely to be so useful for more granular tasks - like finding specific information within a book.

At the other extreme, you could catalog each sentence as a vector. This would build a database that's similar to a thesaurus - albeit, at a sentence level. This would be good for finding specific concepts or information conveyed by the writer. But it might not work so well for finding broader information, such as the idea conveyed by a book, or even a chapter.

Whether to choose one or the other approach, or a third approach in-between, depends on your use case. And we'll talk more about some rules of thumb and key considerations later on.

But the key point for now is that chunking defines the unit of information that is stored in the database, and therefore each unit of information to be retrieved.

As we will see later on, this has implications not only search, but also retrieval augmented generation (RAG) use case downstream.

To meet model requirements

Another reason to chunk data is to help meet the requirements of the language model used.

These models typically have a finite "window" of text input lengths, and source texts often exceed this length. Remember that the Lord of the Rings, for example, is over 500,000 words long!

A typical "context window" of these models are in the order of thousands of "tokens" (words, parts of words, punctuation, etc.). This means that the text input to the model must be split into chunks of this size or smaller.

For optimal retrieval augmented generation (RAG)

Chunking is also important for optimizing retrieval augmented generation, or RAG. (If you need a refresher on what RAG is, you can review this module.)

In short, RAG allows you to ground large language models (LLMs) by providing retrieved data from a database along with a prompt. This in turn can prevent the model from generating factually incorrect information due to outdated or missing data.

So why does chunking affect RAG? This is because LLMs currently have a finite maximum input size, called the context window. As a result, the chunk size defines how many chunks can be included in the context window. This in turn defines how many different places the LLM can retrieve information from, and how much information is in each object.

Let's consider what happens when you use a chunk size that is too small or too large.

(Too) Small chunks

Using short chunks, you can add information from more chunks to the LLM. However, it may lead to insufficient contextual information being passed on in each result to the LLM.

Imagine passing the following as a chunk to the LLM: In the dense areas, most of the concentration is via medium- and high-rise buildings.. This tells a lot about the nature of this area, but without further context it's not useful to the LLM. Where is this sentence about? Why are we talking about the density anyway? As a human reader, it's not clear at all to us, and it would be the same for the LLM, which would have to guess at these answers.

Contrast that with instead passing: Wikipedia: London: Architecture: In the dense areas, most of the concentration is via medium- and high-rise buildings. London's skyscrapers, such as 30 St Mary Axe (dubbed "The Gherkin"), Tower 42, the Broadgate Tower and One Canada Square, are mostly in the two financial districts, the City of London and Canary Wharf. as a chunk.

This includes more contextual information to the LLM, such as the source, article title, section title, and additional sentences. It is far clearer to us, and would be to the LLM as well.

(Too) Large chunks

Of course, on the other hand, using larger chunks would mean fewer chunks would fit into the context window of the LLM, or incur additional costs. And it may increase the amount of irrelevant information in the data.

Taking this to the logical conclusion, imagine you could only pass one contiguous passage of text to the LLM. It would be like being asked to write an essay based only on one section of a book.

Either extremes are not ideal, and the trick is to find a balance that works for you.

Chunk size selection

As you can start to see, there are multiple factors at play to help you choose the right chunk size.

Unfortunately, there isn't a chunk size or chunking technique that works for everybody. The trick here is to find a size that works for you - one that isn't too small or too large, and also chunked with a method that suits you.

In the next unit, we'll begin to review these ideas, starting with some common chunking techniques.

Questions and feedback

If you have any questions or feedback, please let us know on our forum. For example, you can: