Skip to main content

Example part 2 - Search

Preview unit

This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.

Overview

In the preceding section, we imported multiple chapters of a book into Weaviate using different chunking techniques. They were:

  • Fixed-length chunks (and 20% overlap)
    • With 25 words per chunk, and
    • With 100 words per chunk
  • Variable-length chunks, using paragraph markers, and
  • Mixed-strategy chunks, using paragraph markers and a minimum chunk length of 25 words.

Now, we will use Weaviate to search through the book and evaluate the impact of the chunking techniques.

Since the data comes from the first two chapters of a book about Git, let's search for various git-related concepts and see how the different chunking strategies perform.

Search / recall

First of all, we'll retrieve information from our Weaviate instance using various search terms. We'll use a semantic search (nearText) to aim to retrieve the most relevant chunks.

Search syntax

The search is carried out as follows, looping through each chunking strategy by filtering our dataset. We'll obtain a couple of top results for each search term.

search_string = "history of git"  # Or "available git remote commands"

for chunking_strategy in chunk_obj_sets.keys():
where_filter = {
"path": ["chunking_strategy"],
"operator": "Equal",
"valueText": chunking_strategy
}
response = (
client.query.get("Chunk", ["chunk"])
.with_near_text({"concepts": [search_string]})
.with_where(where_filter)
.with_limit(2)
.do()
)

Using these search terms:

  • "history of git"
  • "how to add the url of a remote repository"

Results & discussions

We get the following results:

Example 1

Results for a search for "history of git".
====================
Retrieved objects for fixed_size_25
===== Object 0 =====
=== A Short History of Git As with many great things in life, Git began with a bit of creative destruction and fiery controversy. The
===== Object 1 =====
kernel efficiently (speed and data size) Since its birth in 2005, Git has evolved and matured to be easy to use and yet retain these initial qualities. It's amazingly fast,

The query in this example is a broad one on the history of git. The result is that here, the longer chunks seem to perform better.

Inspecting the result, we see that while the 25-word chunks may be semantically similar to the query history of git, they do not contain enough contextual information to enhance the readers' understanding of the topic.

On the other hand, the paragraph chunks retrieved - especially those with a minimum length of 25 words - contain a good amount of holistic information that will teach the reader about the history of git.

Example 2

Results for a search for "how to add the url of a remote repository".
====================
Retrieved objects for fixed_size_25
===== Object 0 =====
remote))) To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`: [source,console] ---- $ git remote origin $ git remote
===== Object 1 =====
to and from them when you need to share work. Managing remote repositories includes knowing how to add remote repositories, remove remotes that are no longer valid, manage various remote

The query in this example was a more specific one, for example one that might be run by a user looking to identify how to add the url of a remote repository.

In contrast to the first scenario, the 25-word chunks are more useful here. Because the question was very specific, Weaviate was able to identify the chunk containing the most suitable passage - how to add a remote repository (git remote add <shortname> <url>).

While the other result sets also contain some of this information, it may be worth considering how the result may be used and displayed. The longer the result, the more cognitive effort it may take the user to identify the relevant information.

Retrieval augmented generation (RAG)

Next, let's take a look at the impact of chunking on RAG.

We discussed the relationship between chunk size and RAG earlier. Using shorter chunks will allow you to include information from a wider range of source objects than longer chunks, but each object will not include as much contextual information. On the other hand, using longer chunks means each chunk will include more contextual information, but you will be limited to fewer source objects.

Let's try a few RAG examples to see how this manifests itself.

Query syntax

The query syntax is shown below. The syntax is largely the same as above, except for two aspects.

One is that to account for varying chunk sizes, we will retrieve more chunks where the chunk size is smaller.

The other is that the query has been modified to perform RAG, rather than a simple retrieval. The query asks the target LLM to summarize the results into point form.

# Set number of chunks to retrieve to compensate for different chunk sizes
n_chunks_by_strat = dict()
# Grab more of shorter chunks
n_chunks_by_strat['fixed_size_25'] = 8
n_chunks_by_strat['para_chunks'] = 8
# Grab fewer of longer chunks
n_chunks_by_strat['fixed_size_100'] = 2
n_chunks_by_strat['para_chunks_min_25'] = 2

# Perform Retreval augmented generation
search_string = "history of git" # Or "available git remote commands"

for chunking_strategy in chunk_obj_sets.keys():
where_filter = {
"path": ["chunking_strategy"],
"operator": "Equal",
"valueText": chunking_strategy
}
response = (
client.query.get("Chunk", ["chunk"])
.with_near_text({"concepts": [search_string]})
.with_generate(
grouped_task=f"Using this information, please explain {search_string} in a few short points"
)
.with_where(where_filter)
.with_limit(n_chunks_by_strat[chunking_strategy]) # Variable number of chunks retrieved
.do()
)

Results & discussions

Example 1

Results for a search for "history of git".
Generated text for fixed_size_25
- Git was created in 2005 as a result of creative destruction and controversy.
- It was designed to handle the Linux kernel efficiently in terms of speed and data size.
- Over time, Git has evolved to be easy to use while retaining its initial qualities.
- Git reconsiders many aspects of version control, making it more like a mini filesystem with powerful tools.
- Git stores the entire history of a project locally, allowing for fast and instantaneous operations.

The findings here are similar to the semantic search results. The longer chunks contain more information, and are more useful for a broad topic like the history of git.

Example 2

Results for a search for "available git remote commands".
Generated text for fixed_size_25
- `git fetch <remote>`: This command retrieves data from the remote repository specified by `<remote>`.
- `git remote show <remote>`: Running this command with a specific shortname, such as `origin`, displays information about the remote repository, including its branches and configuration.
- `git remote`: This command lists all the remote servers that have been configured for the repository.
- `git remote -v`: Similar to `git remote`, this command lists all the remote servers along with their URLs for fetching and pushing.
- `git clone`: This command is used to create a local copy of a remote repository. By default, it sets up the local `master` branch to track the remote repository's `master` branch.
- `git remote add <name> <url>`: This command adds a new remote repository with the specified `<name>` and `<url>`. This allows you to easily fetch and push changes to and from the remote repository.
- `git remote remove <name>`: This command removes the remote repository with the specified `<name>` from the local repository.

The results of the generative search here for available git remote commands are perhaps even more illustrative than before.

Here, the shortest chunks were able to retrieve the highest number of git remote commands from the book. This is because we were able to retrieve more chunks from various locations throughout the corpus (book).

Contrast this result to the one where longer chunks are used. Here, using longer chunks, we were only able to retrieve one git remote command, because we retrieved fewer chunks than before.

Discussions

You see here the trade-off between using shorter and longer chunks.

Using shorter chunks allows you to retrieve more information from more objects, but each object will contain less contextual information. On the other hand, using longer chunks allows you to retrieve less information from fewer objects, but each object will contain more contextual information.

Even when using LLMs with very large context windows, this is something to keep in mind. Longer input texts means higher fees for the API use, or inference time. In other words, there are costs associated with using longer chunks.

Often, this is the trade-off that you will need to consider when deciding on the chunking strategy for a RAG use-case.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.