Chunking techniques - 1
This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.
Overview
Now that you've learned about what chunking is, and why it is important, you are ready to start looking at practical chunking techniques. Here, we start by looking at fixed-size chunking techniques, including some example implementations.
Fixed-size chunking
As the name suggests, fixed-size chunking refers to the process of splitting texts into chunks of a fixed size, or at least based on size. Using fixed size chunking, you might split an article into a set of chunks of 100 words each, or a set of 200 characters each.
This may be the most common chunking technique due to its simplicity and effectiveness.
Implementations
Fixed-size chunking is implemented by splitting texts into chunks of a fixed number of units. The units may be composed of words, characters, or even tokens, and the number of units per chunk is fixed (to a maximum), with an optional overlap.
A "token" in this context is a unit of text that will be processed by a model by being substituted with a number. In modern tranformer models, a token is commonly a "subword" unit composed of a few characters.
One pseudocode implementation of fixed-size chunking is:
# Given a text of length L
# Split the text into chunks of size N units (e.g. tokens, characters, words)
# Optionally, add an overlap of M units at the beginning or end of each chunk (from the previous or next chunk)
# This should typically result in a list of chunks of length L // N + 1
And implementing in Python, it may look like:
- Python
from typing import List
# Split the text into units (words, in this case)
def word_splitter(source_text: str) -> List[str]:
import re
source_text = re.sub("\s+", " ", source_text) # Replace multiple whitespces
return re.split("\s", source_text) # Split by single whitespace
def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
text_words = word_splitter(text)
chunks = []
for i in range(0, len(text_words), chunk_size):
chunk_words = text_words[i: i + chunk_size]
chunk = " ".join(chunk_words)
chunks.append(chunk)
return chunks
Which can be modified to include an overlap (in this case, at the beginning of each chunk):
- Python
from typing import List
# Split the text into units (words, in this case)
def word_splitter(source_text: str) -> List[str]:
import re
source_text = re.sub("\s+", " ", source_text) # Replace multiple whitespces
return re.split("\s", source_text) # Split by single whitespace
def get_chunks_fixed_size_with_overlap(text: str, chunk_size: int, overlap_fraction: float) -> List[str]:
text_words = word_splitter(text)
overlap_int = int(chunk_size * overlap_fraction)
chunks = []
for i in range(0, len(text_words), chunk_size):
chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size]
chunk = " ".join(chunk_words)
chunks.append(chunk)
return chunks
This is far from the only way to implement fixed-size chunking, but it is one possible, relatively simple, implementation.
Consider how you might implement fixed-size chunking. What would your pseudocode (or code) look like?
Examples
We are ready to look at some concrete examples of fixed-size chunking. Let's take a look at three examples, with a chunk size of 5 words, 25 words and 100 words, respectively.
We'll use an excerpt from the Pro Git book*. More specifically, we'll use text of the What is Git? chapter.
Here is one example using our chunking function from above:
- Python
from typing import List
# Get source data
import requests
url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
source_text = requests.get(url).text
# Chunk text by number of words
for chosen_size in [5, 25, 100]:
chunks = get_chunks_fixed_size_with_overlap(source_text, chosen_size, overlap_fraction=0.2)
# Print outputs to screen
print(f"\nSize {chosen_size} - {len(chunks)} chunks returned.")
for i in range(3):
print(f"Chunk {i+1}: {chunks[i]}")
This will result in outputs like these. Take a look at the first few chunks at each size - what do you notice?
Consider which of these chunk sizes would be most appropriate for search. Why do you think so? What are the tradeoffs?
- By 5 words
- By 25 words
- By 100 words
Size 5 - 281 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git?
Chunk 2: Git? So, what is Git in
Chunk 3: in a nutshell? This is an
Size 25 - 57 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git
Chunk 2: if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to
Chunk 3: you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid
Size 100 - 15 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool. Even though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in
Chunk 2: tool. Even though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce))) ==== Snapshots, Not Differences The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These other systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they store as a set of files and the changes made to each file over time (this is commonly described as _delta-based_ version control). .Storing data as changes to a base version
Chunk 3: each file over time (this is commonly described as _delta-based_ version control). .Storing data as changes to a base version of each file image::images/deltas.png[Storing data as changes to a base version of each file] Git doesn't think of or store its data this way. Instead, Git thinks of its data more like a series of snapshots of a miniature filesystem. With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn't store the file again, just a link to the previous identical file it has already
Hopefully, these concrete examples start to illustrate some of the ideas that we discussed above.
Immediately, it strikes me that the smaller chunks are very granular, to the point where they may not contain enough information to be useful for search. On the other hand, the larger chunks begin to retain more information as they get to lengths that are similar to a typical paragraph.
Now imagine these chunks becoming even longer. As chunks become longer, the corresponding vector embeddings would start to become more general. This would eventually reach a point where they cease to be useful in terms of searching for information.
At these sizes, you typically will not need to employ character-based or sub-word token-based chunking, as splitting words at these boundaries in a group of words will not typically be meaningful.
For search with fixed-size chunks, if you don't have any other factors, try a size of around 100-200 words, and a 20% overlap.
Notes
Questions and feedback
If you have any questions or feedback, let us know in the user forum.