Skip to main content

Considerations & suggestions

Preview unit

This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.

Overview

We have covered a lot of ground on chunking in this unit already.

You saw what chunking is and learned about different chunking methods, and dived into example implementations so that you can see their impact.

In this section, we will take a step back out from the detailed, micro view to the high level, macro view, while using all that we've learned in context. More specifically, we will take a look at some considerations of what to think about when chunking data, and what it means for your Weaviate implementation.

Considerations

As you have seen, there are many different ways to chunk data. But which one is right for you?

The answer is, as always, "it depends". But here are some things to consider when choosing a chunking method:

Text per search result

How much text should each "hit" in your search results contain? Is it a sentence, or a paragraph, or something else?

A natural fit would be to chunk the data into the same size as the desired search result.

Input query length

Consider what a typical input query might look like. Will it be short search strings, or longer texts, such as those extracted from a document?

Keep in mind that the vector of the query will be compared to the vector of the chunks. So, it may be helpful to have shorter chunks for shorter queries, and longer chunks for longer queries.

In cases where shorter chunks are used but further context would be beneficial, you could structure your app so that you return the chunk that contains the search term, and the surrounding chunks.

Database size

The larger the chunks, the fewer chunks there will be, and the smaller the database will be. This may be important if you are working with a large dataset.

Model requirements

You will need to ensure that the chunk size is within the model's maximum allowed size (context window). This goes for generating embeddings, as well as for RAG.

RAG workflows

As discussed earlier, shorter chunks will make it easier to include many chunks from a variety of sources, but may not provide enough context. Longer chunks will provide more context, but may not be able to include as many chunks.

Rule of thumb

Having said all that, it may be helpful to have a rule of thumb to start with. We suggest starting with a chunk size of 100-150 words and going from there.

Then, you can modify the chunk size based on the considerations above, and your observations on your app's performance.

Data modelling

By definition, chunking your source data will mean creating multiple objects out of one source.

Accordingly, you should consider how to model your data to capture the relationships between the chunks and the source data, as well as between chunks. This may help you to efficiently retrieve what you need, such as the metadata relating to the source, or surrounding chunks.

Collection definition examples

Consider a Weaviate database designed to store data from a library of reference books.

Storing each book as a vector may still be too large, so you may want to chunk the books into paragraphs. Having done so, you may want to create a Book collection, and a Paragraph collection, with the Paragraph collection having the cross-reference property fromBook. This will allow you to retrieve the book metadata from the Book collection, and the surrounding paragraphs from the Paragraph collection.

So, for example, you may build a Book collection like this:

{
"class": "Book",
"properties": [
... // other class properties
{
"name": "title",
"dataType": ["text"],
},
{
"name": "text",
"dataType": ["text"],
},
],
"vectorIndexConfig": {
"skip": true
}
... // other class attributes
}

And add a Paragraph collection like this, that references the Book collection:

{
"class": "Paragraph",
"properties": [
... // other class properties
{
"name": "body",
"dataType": ["Text"]
},
{
"name": "chunk_number",
"dataType": ["int"]
},
{
"name": "fromBook",
"dataType": ["Book"]
},
],
... // other class attributes (e.g. vectorizer)
}

Note that in this configuration, the Book collection is not vectorized, but the Paragraph collection is. This will allow the Book collection to be used for storage and retrieval of metadata, while the Paragraph collection is used for search.

This is just one example of how you could model your data. You may want to experiment with different configurations to see what works best for your use case.

Questions and feedback

If you have any questions or feedback, please let us know on our forum. For example, you can: