Skip to main content

Chunking techniques - 1

Preview unit

This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.

Overview​

Now that you've learned about what chunking is, and why it is important, you are ready to start looking at practical chunking techniques. Here, we start by looking at fixed-size chunking techniques, including some example implementations.

Fixed-size chunking​

As the name suggests, fixed-size chunking refers to the process of splitting texts into chunks of a fixed size, or at least based on size. Using fixed size chunking, you might split an article into a set of chunks of 100 words each, or a set of 200 characters each.

This may be the most common chunking technique due to its simplicity and effectiveness.

Implementations​

Fixed-size chunking is implemented by splitting texts into chunks of a fixed number of units. The units may be composed of words, characters, or even tokens, and the number of units per chunk is fixed (to a maximum), with an optional overlap.

What is a token?

A "token" in this context is a unit of text that will be processed by a model by being substituted with a number. In modern tranformer models, a token is commonly a "subword" unit composed of a few characters.

One pseudocode implementation of fixed-size chunking is:

# Given a text of length L
# Split the text into chunks of size N units (e.g. tokens, characters, words)
# Optionally, add an overlap of M units at the beginning or end of each chunk (from the previous or next chunk)
# This should typically result in a list of chunks of length L // N + 1

And implementing in Python, it may look like:

from typing import List

# Split the text into units (words, in this case)
def word_splitter(source_text: str) -> List[str]:
import re
source_text = re.sub("\s+", " ", source_text) # Replace multiple whitespces
return re.split("\s", source_text) # Split by single whitespace

def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
text_words = word_splitter(text)
chunks = []
for i in range(0, len(text_words), chunk_size):
chunk_words = text_words[i: i + chunk_size]
chunk = " ".join(chunk_words)
chunks.append(chunk)
return chunks

Which can be modified to include an overlap (in this case, at the beginning of each chunk):

from typing import List

# Split the text into units (words, in this case)
def word_splitter(source_text: str) -> List[str]:
import re
source_text = re.sub("\s+", " ", source_text) # Replace multiple whitespces
return re.split("\s", source_text) # Split by single whitespace

def get_chunks_fixed_size_with_overlap(text: str, chunk_size: int, overlap_fraction: float) -> List[str]:
text_words = word_splitter(text)
overlap_int = int(chunk_size * overlap_fraction)
chunks = []
for i in range(0, len(text_words), chunk_size):
chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size]
chunk = " ".join(chunk_words)
chunks.append(chunk)
return chunks

This is far from the only way to implement fixed-size chunking, but it is one possible, relatively simple, implementation.

Exercise

Consider how you might implement fixed-size chunking. What would your pseudocode (or code) look like?

Examples​

We are ready to look at some concrete examples of fixed-size chunking. Let's take a look at three examples, with a chunk size of 5 words, 25 words and 100 words, respectively.

We'll use an excerpt from the Pro Git book*. More specifically, we'll use text of the What is Git? chapter.

Here is one example using our chunking function from above:

from typing import List

# Get source data
import requests

url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
source_text = requests.get(url).text

# Chunk text by number of words
for chosen_size in [5, 25, 100]:
chunks = get_chunks_fixed_size_with_overlap(source_text, chosen_size, overlap_fraction=0.2)
# Print outputs to screen
print(f"\nSize {chosen_size} - {len(chunks)} chunks returned.")
for i in range(3):
print(f"Chunk {i+1}: {chunks[i]}")

This will result in outputs like these. Take a look at the first few chunks at each size - what do you notice?

Exercise

Consider which of these chunk sizes would be most appropriate for search. Why do you think so? What are the tradeoffs?

Size 5 - 281 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git?
Chunk 2: Git? So, what is Git in
Chunk 3: in a nutshell? This is an

Hopefully, these concrete examples start to illustrate some of the ideas that we discussed above.

Immediately, it strikes me that the smaller chunks are very granular, to the point where they may not contain enough information to be useful for search. On the other hand, the larger chunks begin to retain more information as they get to lengths that are similar to a typical paragraph.

Now imagine these chunks becoming even longer. As chunks become longer, the corresponding vector embeddings would start to become more general. This would eventually reach a point where they cease to be useful in terms of searching for information.

What about character or sub-word tokenization?

At these sizes, you typically will not need to employ character-based or sub-word token-based chunking, as splitting words at these boundaries in a group of words will not typically be meaningful.

Where to start?

For search with fixed-size chunks, if you don't have any other factors, try a size of around 100-200 words, and a 20% overlap.

Notes​

Pro Git by Scott Chacon and Ben Straub - Book License

Questions and feedback​

If you have any questions or feedback, let us know in the user forum.