This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.
We have covered a lot of ground on chunking in this unit already.
You saw what chunking is and learned about different chunking methods, and dived into example implementations so that you can see their impact.
In this section, we will take a step back out from the detailed, micro view to the high level, macro view, while using all that we've learned in context. More specifically, we will take a look at some considerations of what to think about when chunking data, and what it means for your Weaviate implementation.
As you have seen, there are many different ways to chunk data. But which one is right for you?
The answer is, as always, "it depends". But here are some things to consider when choosing a chunking method:
Text per search result
How much text should each "hit" in your search results contain? Is it a sentence, or a paragraph, or something else?
A natural fit would be to chunk the data into the same size as the desired search result.
Input query length
Consider what a typical input query might look like. Will it be short search strings, or longer texts, such as those extracted from a document?
Keep in mind that the vector of the query will be compared to the vector of the chunks. So, it may be helpful to have shorter chunks for shorter queries, and longer chunks for longer queries.
In cases where shorter chunks are used but further context would be beneficial, you could structure your app so that you return the chunk that contains the search term, and the surrounding chunks.
The larger the chunks, the fewer chunks there will be, and the smaller the database will be. This may be important if you are working with a large dataset.
You will need to ensure that the chunk size is within the model's maximum allowed size (context window). This goes for generating embeddings, as well as for RAG.
As discussed earlier, shorter chunks will make it easier to include many chunks from a variety of sources, but may not provide enough context. Longer chunks will provide more context, but may not be able to include as many chunks.
Rule of thumb
Having said all that, it may be helpful to have a rule of thumb to start with. We suggest starting with a chunk size of 100-150 words and going from there.
Then, you can modify the chunk size based on the considerations above, and your observations on your app's performance.
By definition, chunking your source data will mean creating multiple objects out of one source.
Accordingly, you should consider how to model your data to capture the relationships between the chunks and the source data, as well as between chunks. This may help you to efficiently retrieve what you need, such as the metadata relating to the source, or surrounding chunks.
Collection definition examples
Consider a Weaviate database designed to store data from a library of reference books.
Storing each book as a vector may still be too large, so you may want to chunk the books into paragraphs. Having done so, you may want to create a
Book collection, and a
Paragraph collection, with the
Paragraph collection having the cross-reference property
fromBook. This will allow you to retrieve the book metadata from the
Book collection, and the surrounding paragraphs from the
So, for example, you may build a
Book collection like this:
... // other class properties
... // other class attributes
And add a
Paragraph collection like this, that references the
... // other class properties
... // other class attributes (e.g. vectorizer)
Note that in this configuration, the
Book collection is not vectorized, but the
Paragraph collection is. This will allow the
Book collection to be used for storage and retrieval of metadata, while the
Paragraph collection is used for search.
This is just one example of how you could model your data. You may want to experiment with different configurations to see what works best for your use case.