Example part 2 - Search
This is a preview version of this unit. So some sections are not yet complete - such as videos and quiz questions. Please check back later for the full version, and in the meantime, feel free to provide any feedback through the comments below.
Overview
In the preceding section, we imported multiple chapters of a book into Weaviate using different chunking techniques. They were:
- Fixed-length chunks (and 20% overlap)
- With 25 words per chunk, and
- With 100 words per chunk
- Variable-length chunks, using paragraph markers, and
- Mixed-strategy chunks, using paragraph markers and a minimum chunk length of 25 words.
Now, we will use Weaviate to search through the book and evaluate the impact of the chunking techniques.
Since the data comes from the first two chapters of a book about Git, let's search for various git-related concepts and see how the different chunking strategies perform.
Search / recall
First of all, we'll retrieve information from our Weaviate instance using various search terms. We'll use a semantic search (nearText
) to aim to retrieve the most relevant chunks.
Search syntax
The search is carried out as follows, looping through each chunking strategy by filtering our dataset. We'll obtain a couple of top results for each search term.
- Python
search_string = "history of git" # Or "available git remote commands"
for chunking_strategy in chunk_obj_sets.keys():
where_filter = {
"path": ["chunking_strategy"],
"operator": "Equal",
"valueText": chunking_strategy
}
response = (
client.query.get("Chunk", ["chunk"])
.with_near_text({"concepts": [search_string]})
.with_where(where_filter)
.with_limit(2)
.do()
)
Using these search terms:
"history of git"
"how to add the url of a remote repository"
Results & discussions
We get the following results:
Example 1
"history of git"
.- 25 word chunks
- 100 word chunks
- Paragraph chunks
- Paragraph chunks with minimum length
====================
Retrieved objects for fixed_size_25
===== Object 0 =====
=== A Short History of Git As with many great things in life, Git began with a bit of creative destruction and fiery controversy. The
===== Object 1 =====
kernel efficiently (speed and data size) Since its birth in 2005, Git has evolved and matured to be easy to use and yet retain these initial qualities. It's amazingly fast,
====================
Retrieved objects for fixed_size_100
===== Object 0 =====
=== A Short History of Git As with many great things in life, Git began with a bit of creative destruction and fiery controversy. The Linux kernel is an open source software project of fairly large scope.(((Linux))) During the early years of the Linux kernel maintenance (1991–2002), changes to the software were passed around as patches and archived files. In 2002, the Linux kernel project began using a proprietary DVCS called BitKeeper.(((BitKeeper))) In 2005, the relationship between the community that developed the Linux kernel and the commercial company that developed BitKeeper broke down, and the tool's free-of-charge status was revoked.
===== Object 1 =====
2005, Git has evolved and matured to be easy to use and yet retain these initial qualities. It's amazingly fast, it's very efficient with large projects, and it has an incredible branching system for non-linear development (see <<ch03-git-branching#ch03-git-branching>>).
====================
Retrieved objects for para_chunks
===== Object 0 =====
Since its birth in 2005, Git has evolved and matured to be easy to use and yet retain these initial qualities.
It's amazingly fast, it's very efficient with large projects, and it has an incredible branching system for non-linear development (see <<ch03-git-branching#ch03-git-branching>>).
===== Object 1 =====
As with many great things in life, Git began with a bit of creative destruction and fiery controversy.
====================
Retrieved objects for para_chunks_min_25
===== Object 0 =====
=== A Short History of Git
As with many great things in life, Git began with a bit of creative destruction and fiery controversy.
The Linux kernel is an open source software project of fairly large scope.(((Linux)))
During the early years of the Linux kernel maintenance (1991–2002), changes to the software were passed around as patches and archived files.
In 2002, the Linux kernel project began using a proprietary DVCS called BitKeeper.(((BitKeeper)))
In 2005, the relationship between the community that developed the Linux kernel and the commercial company that developed BitKeeper broke down, and the tool's free-of-charge status was revoked.
This prompted the Linux development community (and in particular Linus Torvalds, the creator of Linux) to develop their own tool based on some of the lessons they learned while using BitKeeper.(((Linus Torvalds)))
Some of the goals of the new system were as follows:
* Speed
* Simple design
* Strong support for non-linear development (thousands of parallel branches)
* Fully distributed
* Able to handle large projects like the Linux kernel efficiently (speed and data size)
Since its birth in 2005, Git has evolved and matured to be easy to use and yet retain these initial qualities.
It's amazingly fast, it's very efficient with large projects, and it has an incredible branching system for non-linear development (see <<ch03-git-branching#ch03-git-branching>>).
===== Object 1 =====
== Nearly Every Operation Is Local
Most operations in Git need only local files and resources to operate -- generally no information is needed from another computer on your network.
If you're used to a CVCS where most operations have that network latency overhead, this aspect of Git will make you think that the gods of speed have blessed Git with unworldly powers.
Because you have the entire history of the project right there on your local disk, most operations seem almost instantaneous.
For example, to browse the history of the project, Git doesn't need to go out to the server to get the history and display it for you -- it simply reads it directly from your local database.
This means you see the project history almost instantly.
If you want to see the changes introduced between the current version of a file and the file a month ago, Git can look up the file a month ago and do a local difference calculation, instead of having to either ask a remote server to do it or pull an older version of the file from the remote server to do it locally.
This also means that there is very little you can't do if you're offline or off VPN.
If you get on an airplane or a train and want to do a little work, you can commit happily (to your _local_ copy, remember?) until you get to a network connection to upload.
If you go home and can't get your VPN client working properly, you can still work.
In many other systems, doing so is either impossible or painful.
In Perforce, for example, you can't do much when you aren't connected to the server; in Subversion and CVS, you can edit files, but you can't commit changes to your database (because your database is offline).
This may not seem like a huge deal, but you may be surprised what a big difference it can make.
The query in this example is a broad one on the history of git
. The result is that here, the longer chunks seem to perform better.
Inspecting the result, we see that while the 25-word chunks may be semantically similar to the query history of git
, they do not contain enough contextual information to enhance the readers' understanding of the topic.
On the other hand, the paragraph chunks retrieved - especially those with a minimum length of 25 words - contain a good amount of holistic information that will teach the reader about the history of git.
Example 2
"how to add the url of a remote repository"
.- 25 word chunks
- 100 word chunks
- Paragraph chunks
- Paragraph chunks with minimum length
====================
Retrieved objects for fixed_size_25
===== Object 0 =====
remote))) To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`: [source,console] ---- $ git remote origin $ git remote
===== Object 1 =====
to and from them when you need to share work. Managing remote repositories includes knowing how to add remote repositories, remove remotes that are no longer valid, manage various remote
====================
Retrieved objects for fixed_size_100
===== Object 0 =====
adds the `origin` remote for you. Here's how to add a new remote explicitly.(((git commands, remote))) To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`: [source,console] ---- $ git remote origin $ git remote add pb https://github.com/paulboone/ticgit $ git remote -v origin https://github.com/schacon/ticgit (fetch) origin https://github.com/schacon/ticgit (push) pb https://github.com/paulboone/ticgit (fetch) pb https://github.com/paulboone/ticgit (push) ---- Now you can use the string `pb` on the command line in lieu of the whole URL. For example, if you want to fetch all the information that Paul has but that you don't yet have in your repository, you can run `git fetch pb`: [source,console] ---- $ git fetch pb remote: Counting objects: 43,
===== Object 1 =====
Managing remote repositories includes knowing how to add remote repositories, remove remotes that are no longer valid, manage various remote branches and define them as being tracked or not, and more. In this section, we'll cover some of these remote-management skills. [NOTE] .Remote repositories can be on your local machine. ==== It is entirely possible that you can be working with a "`remote`" repository that is, in fact, on the same host you are. The word "`remote`" does not necessarily imply that the repository is somewhere else on the network or Internet, only that it is elsewhere. Working with such a remote repository would still involve all the standard pushing, pulling and fetching operations as with any other remote. ====
====================
Retrieved objects for para_chunks
===== Object 0 =====
We've mentioned and given some demonstrations of how the `git clone` command implicitly adds the `origin` remote for you.
Here's how to add a new remote explicitly.(((git commands, remote)))
To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`:
===== Object 1 =====
To be able to collaborate on any Git project, you need to know how to manage your remote repositories.
Remote repositories are versions of your project that are hosted on the Internet or network somewhere.
You can have several of them, each of which generally is either read-only or read/write for you.
Collaborating with others involves managing these remote repositories and pushing and pulling data to and from them when you need to share work.
Managing remote repositories includes knowing how to add remote repositories, remove remotes that are no longer valid, manage various remote branches and define them as being tracked or not, and more.
In this section, we'll cover some of these remote-management skills.
====================
Retrieved objects for para_chunks_min_25
===== Object 0 =====
== Adding Remote Repositories
We've mentioned and given some demonstrations of how the `git clone` command implicitly adds the `origin` remote for you.
Here's how to add a new remote explicitly.(((git commands, remote)))
To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`:
[source,console]
----
$ git remote
origin
$ git remote add pb https://github.com/paulboone/ticgit
$ git remote -v
origin https://github.com/schacon/ticgit (fetch)
origin https://github.com/schacon/ticgit (push)
pb https://github.com/paulboone/ticgit (fetch)
pb https://github.com/paulboone/ticgit (push)
----
Now you can use the string `pb` on the command line in lieu of the whole URL.
For example, if you want to fetch all the information that Paul has but that you don't yet have in your repository, you can run `git fetch pb`:
[source,console]
----
$ git fetch pb
remote: Counting objects: 43, done.
remote: Compressing objects: 100% (36/36), done.
remote: Total 43 (delta 10), reused 31 (delta 5)
Unpacking objects: 100% (43/43), done.
From https://github.com/paulboone/ticgit
* [new branch] master -> pb/master
* [new branch] ticgit -> pb/ticgit
----
Paul's `master` branch is now accessible locally as `pb/master` -- you can merge it into one of your branches, or you can check out a local branch at that point if you want to inspect it.
We'll go over what branches are and how to use them in much more detail in <<ch03-git-branching#ch03-git-branching>>.
[[_fetching_and_pulling]]
===== Object 1 =====
[[_remote_repos]]= Working with Remotes
To be able to collaborate on any Git project, you need to know how to manage your remote repositories.
Remote repositories are versions of your project that are hosted on the Internet or network somewhere.
You can have several of them, each of which generally is either read-only or read/write for you.
Collaborating with others involves managing these remote repositories and pushing and pulling data to and from them when you need to share work.
Managing remote repositories includes knowing how to add remote repositories, remove remotes that are no longer valid, manage various remote branches and define them as being tracked or not, and more.
In this section, we'll cover some of these remote-management skills.
[NOTE]
.Remote repositories can be on your local machine.
The query in this example was a more specific one, for example one that might be run by a user looking to identify how to add the url of a remote repository.
In contrast to the first scenario, the 25-word chunks are more useful here. Because the question was very specific, Weaviate was able to identify the chunk containing the most suitable passage - how to add a remote repository (git remote add <shortname> <url>
).
While the other result sets also contain some of this information, it may be worth considering how the result may be used and displayed. The longer the result, the more cognitive effort it may take the user to identify the relevant information.
Retrieval augmented generation (RAG)
Next, let's take a look at the impact of chunking on RAG.
We discussed the relationship between chunk size and RAG earlier. Using shorter chunks will allow you to include information from a wider range of source objects than longer chunks, but each object will not include as much contextual information. On the other hand, using longer chunks means each chunk will include more contextual information, but you will be limited to fewer source objects.
Let's try a few RAG examples to see how this manifests itself.
Query syntax
The query syntax is shown below. The syntax is largely the same as above, except for two aspects.
One is that to account for varying chunk sizes, we will retrieve more chunks where the chunk size is smaller.
The other is that the query has been modified to perform RAG, rather than a simple retrieval. The query asks the target LLM to summarize the results into point form.
- Python
# Set number of chunks to retrieve to compensate for different chunk sizes
n_chunks_by_strat = dict()
# Grab more of shorter chunks
n_chunks_by_strat['fixed_size_25'] = 8
n_chunks_by_strat['para_chunks'] = 8
# Grab fewer of longer chunks
n_chunks_by_strat['fixed_size_100'] = 2
n_chunks_by_strat['para_chunks_min_25'] = 2
# Perform Retreval augmented generation
search_string = "history of git" # Or "available git remote commands"
for chunking_strategy in chunk_obj_sets.keys():
where_filter = {
"path": ["chunking_strategy"],
"operator": "Equal",
"valueText": chunking_strategy
}
response = (
client.query.get("Chunk", ["chunk"])
.with_near_text({"concepts": [search_string]})
.with_generate(
grouped_task=f"Using this information, please explain {search_string} in a few short points"
)
.with_where(where_filter)
.with_limit(n_chunks_by_strat[chunking_strategy]) # Variable number of chunks retrieved
.do()
)
Results & discussions
Example 1
"history of git"
.- 25 word chunks
- 100 word chunks
- Paragraph chunks
- Paragraph chunks with minimum length
Generated text for fixed_size_25
- Git was created in 2005 as a result of creative destruction and controversy.
- It was designed to handle the Linux kernel efficiently in terms of speed and data size.
- Over time, Git has evolved to be easy to use while retaining its initial qualities.
- Git reconsiders many aspects of version control, making it more like a mini filesystem with powerful tools.
- Git stores the entire history of a project locally, allowing for fast and instantaneous operations.
Generated text for fixed_size_100
- In the early years of the Linux kernel maintenance (1991-2002), changes to the software were passed around as patches and archived files.
- In 2002, the Linux kernel project started using a proprietary DVCS called BitKeeper.
- In 2005, the relationship between the Linux kernel community and the company behind BitKeeper broke down, leading to the revocation of the tool's free-of-charge status.
- Since then, Git has evolved and matured, becoming easy to use while retaining its initial qualities. It is known for its speed, efficiency with large projects, and its powerful branching system for non-linear development.
Generated text for para_chunks
- Git was created in 2005 and has since evolved and matured to be easy to use and efficient with large projects.
- Git has an incredibly fast performance and a powerful branching system for non-linear development.
- Git began with controversy and creative destruction.
- Git is fundamentally different from other version control systems (VCS) in the way it thinks about and stores data.
- Git operates mostly on local files and resources, making operations fast and efficient.
- Git has integrity and ensures the integrity of its data.
- Git is more like a mini filesystem with powerful tools built on top of it, rather than just a VCS.
Generated text for para_chunks_min_25
- Git was created in 2005 by the Linux development community, led by Linus Torvalds, after the breakdown of their relationship with the proprietary DVCS called BitKeeper.
- The goals of Git were to be fast, have a simple design, support non-linear development with thousands of parallel branches, be fully distributed, and handle large projects efficiently.
- Git has evolved and matured since its creation, becoming easy to use while retaining its initial qualities.
- One of the key advantages of Git is that nearly every operation is local, meaning that most operations can be performed without needing information from another computer on the network.
- This local nature of Git allows for fast and instantaneous operations, such as browsing the project history or comparing file versions.
- Being able to work offline or off VPN is also a significant advantage of Git, as it allows users to continue working and committing changes to their local copy until they have a network connection to upload.
The findings here are similar to the semantic search results. The longer chunks contain more information, and are more useful for a broad topic like the history of git.
Example 2
"available git remote commands"
.- 25 word chunks
- 100 word chunks
- Paragraph chunks
- Paragraph chunks with minimum length
Generated text for fixed_size_25
- `git fetch <remote>`: This command retrieves data from the remote repository specified by `<remote>`.
- `git remote show <remote>`: Running this command with a specific shortname, such as `origin`, displays information about the remote repository, including its branches and configuration.
- `git remote`: This command lists all the remote servers that have been configured for the repository.
- `git remote -v`: Similar to `git remote`, this command lists all the remote servers along with their URLs for fetching and pushing.
- `git clone`: This command is used to create a local copy of a remote repository. By default, it sets up the local `master` branch to track the remote repository's `master` branch.
- `git remote add <name> <url>`: This command adds a new remote repository with the specified `<name>` and `<url>`. This allows you to easily fetch and push changes to and from the remote repository.
- `git remote remove <name>`: This command removes the remote repository with the specified `<name>` from the local repository.
Generated text for fixed_size_100
- The `git remote` command is used to see which remote servers are configured for the repository. It lists the shortnames of each remote handle that has been specified.
- The `git remote -v` command can be used to display more detailed information about the remote repositories, including the URLs for fetching and pushing.
- The `git clone` command automatically adds the `origin` remote when cloning a repository.
- To add a new remote explicitly, the `git remote add <name> <url>` command can be used. This allows for pulling and pushing to the specified remote repository.
Generated text for para_chunks
- The `git remote` command lists the shortnames of each remote handle that you have configured.
- The `git remote show <remote>` command provides more information about a particular remote.
- The `git remote -v` command shows the URLs associated with each remote.
- The `git remote add <shortname> <url>` command adds a new remote Git repository with a specified shortname and URL.
- The `git remote` command can be used to show all the remotes associated with a repository.
Generated text for para_chunks_min_25
- The `git remote` command is used to see which remote servers you have configured. It lists the shortnames of each remote handle you've specified.
- The `git remote -v` command shows the URLs that Git has stored for the shortname to be used when reading and writing to that remote.
- The `git remote show <remote>` command provides more information about a particular remote, including the URL for the remote repository, tracking branch information, and details about branches that can be automatically merged or pushed to.
The results of the generative search here for available git remote commands
are perhaps even more illustrative than before.
Here, the shortest chunks were able to retrieve the highest number of git remote
commands from the book. This is because we were able to retrieve more chunks from various locations throughout the corpus (book).
Contrast this result to the one where longer chunks are used. Here, using longer chunks, we were only able to retrieve one git remote
command, because we retrieved fewer chunks than before.
Discussions
You see here the trade-off between using shorter and longer chunks.
Using shorter chunks allows you to retrieve more information from more objects, but each object will contain less contextual information. On the other hand, using longer chunks allows you to retrieve less information from fewer objects, but each object will contain more contextual information.
Even when using LLMs with very large context windows, this is something to keep in mind. Longer input texts means higher fees for the API use, or inference time. In other words, there are costs associated with using longer chunks.
Often, this is the trade-off that you will need to consider when deciding on the chunking strategy for a RAG use-case.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.