Pulling back the curtains on text2vec

January 10, 2023 · 10 min read

Educator

You probably know that Weaviate converts a text corpus into a set of vectors - each object is given a vector that captures its 'meaning'. But you might not know exactly how it does that, or how to adjust that behavior. Here, we will pull back the curtains to examine those questions, by revealing some of the mechanics behind text2vec's magic.

First, we will reproduce Weaviate's output vector using only an external API. Then we will see how the text vectorization process can be tweaked, before wrapping up by discussing a few considerations also.

Background

I often find myself saying that Weaviate makes it fast and easy to produce a vector database from text. But it can be easy to forget just how fast and how easy it can make things.

It is true that even in the “old days” of say, five to ten years ago, producing a database with vector capabilities was technically possible. You simply had to (inhales deeply) develop a vectorization algorithm, vectorize the data, build a vector index, build a database with the underlying data, integrate the vector index with the database, then forward results from a vector index query to the database and combine the outputs from both (exhales).

The past for vector searching definitely was not a “simpler time”, and the appeal of modern vector databases like Weaviate is pretty clear given this context.

But while the future is here, it isn't yet perfect. Tools like Weaviate can seem like a magician's mystery box. Our users in turn ask us exactly how Weaviate does its magic; how it turns all of that data into vectors, and how to control the process.

So let's take a look inside the magic box together in this post.

If you would like to follow along, the Jupyter notebook and data are available here. You can use our free Weaviate Cloud Services (WCS) sandbox, or set up your own Weaviate instance also.

Note: The vectorization is done by Weaviate “core” and not at the client level. So even though we use Python examples, the principles are universally applicable. Let's get started!

Text2vec: behind the scenes

Weaviate's text2vec-* modules transform text data into dense vectors for populating a Weaviate database. Each text2vec-* module uses an external API (like text2vec-openai or text2vec-huggingface) or a local instance like text2vec-transformers to produce a vector for each object.

Let's try vectorizing data with the text2vec-cohere module. We will be using data from tiny_jeopardy.csv available here containing questions from the game show Jeopardy. We'll just use a few (20) questions here, but the full dataset on Kaggle includes 200k+ questions.

Load the data into a Pandas dataframe, then populate Weaviate like this:

client.batch.configure(batch_size=100)  # Configure batch
with client.batch as batch:
    for i, row in df.iterrows():
        properties = {
            "question": row.Question,
            "answer": row.Answer
        }
        batch.add_data_object(properties, "Question")

This should add a series of Question objects with text properties like this:

{'question': 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger',
 'answer': "McDonald's"}

Since we use the text2vec-cohere module to vectorize our data, we can query Weaviate to find data objects most similar to any input text. Look for entries closest to fast food chains like so:

near_text = {"concepts": ["fast food chains"]}
wv_resp = client.query.get(
    "Question",
    ["question", "answer"]
).with_limit(2).with_near_text(
    near_text
).with_additional(['distance', 'vector']).do()

This yields the McDonald's Question object above, including the object vector and the distance. The result is a 768-dimensional vector that is about 0.1 away from the query vector.

This all makes intuitive sense - the entry related to the largest fast food chain (McDonald's) is returned from our “fast food chains” query.

But wait, how was that vector derived?

This is the 'magic' part. So let's look behind the curtain, and see if we can reproduce the magic. More specifically, let's try to reproduce Weaviate's output vector for each object by using an external API.

pulling back the curtains

Matching Weaviate's vectorization

We know that the vector for each object corresponds to its text. What we don't know is how, exactly.

As each object only contains the two properties, question and answer, let's try concatenating our text and comparing it to Weaviate's. And instead of the text2vec-cohere module, we will go straight to the Cohere API.

Concatenating the text from the object:

str_in = ' '.join([i for i in properties.values()])

We see:

'In 1963, live on "The Art Linkletter Show", this company served its billionth burger McDonald\'s'

Then, use the Cohere API to generate the vector like so, where cohere_key is the API key (keep it secret!), and model is the vectorizer.

import cohere
co = cohere.Client(cohere_key)
co_resp = co.embed([str_in], model="embed-multilingual-v2.0")

Then we run a nearVector based query to find the best matching object to this vector:

client.query.get(
    "Question",
    ["question", "answer"]
).with_limit(2).with_near_vector(
    {'vector': co_resp.embeddings[0]}
).with_additional(['distance']).do()

Interestingly, we get a distance of 0.0181 - small, but not zero. In other words - we did something differently to Weaviate!

Let's go through the differences one by one, which hopefully will help you to see Weaviate's default behavior.

First, Weaviate sorts properties alphabetically (a-z) before concatenation. Our initial concatenation had the question text come first, so let's reverse it to:

'McDonald\'s In 1963, live on "The Art Linkletter Show", this company served its billionth burger '

This lowers the distance to 0.0147.

Weaviate adds the class name to the text. So we will prepend the word question producing:

'question McDonald\'s In 1963, live on "The Art Linkletter Show", this company served its billionth burger'

Further lowering the distance to 0.0079.

Then the remaining distance can be eliminated by converting the text to lowercase like so:

str_in = ''
for k in sorted(properties.keys()):
    v = properties[k]
    if type(v) == str:
        str_in += v + ' '
str_in = str_in.lower().strip()  # remove trailing whitespace
str_in = 'question ' + str_in

Producing:

'question mcdonald\'s in 1963, live on "the art linkletter show", this company served its billionth burger'

Performing the nearVector search again results in zero distance (1.788e-07 - effectively zero)!

In other words - we have manually reproduced Weaviate's default vectorization process. It's not overly complex, but knowing it can certainly be helpful. Let's recap exactly what Weaviate does.

Text vectorization in Weaviate

Weaviate generates vector embeddings at the object level (rather than for individual properties). For instance text2vec-* modules can generate vectors from text objects. To produce the string to be vectorized from each object, Weaviate follows the schema configuration for the relevant class.

Unless specified otherwise in the schema, the default behavior is to:

Only vectorize properties that use the text data type (unless skipped)
Sort properties in alphabetical (a-z) order before concatenating values
If vectorizePropertyName is true (false by default) prepend the property name to each property value
Join the (prepended) property values with spaces
Prepend the class name (unless vectorizeClassName is false)
Convert the produced string to lowercase

Now that we understand this, you might be asking - is it possible to customize the vectorization process? The answer is, yes, of course.

Tweaking text2vec vectorization in Weaviate

Some of you might have noticed that we have not done anything at all with the schema so far. This meant that the schema used is one generated by the auto-schema feature and thus the vectorizations were carried out using default options.

The schema is the place to define, among other things, the data type and vectorizer to be used, as well as cross-references between classes. As a corollary, the vectorization process can be modified for each class by setting the relevant schema options. In fact, you can define the data schema for each class individually.

All this means that you can also use the schema to tweak Weaviate's vectorization behavior. The relevant variables for vectorization are dataType and those listed under moduleConfig at both the class level and property level.

At the class level, vectorizeClassName will determine whether the class name is used for vectorization.
At the property level:
- skip will determine whether the property should be skipped (i.e. ignored) in vectorization, and
- vectorizePropertyName will determine whether the property name will be used.
The property dataType determines whether Weaviate will ignore the property, as it will ignore everything but string and text values.

You can read more about each variable in the schema configuration documentation. Let's apply this to our data to set Weaviate's vectorization behavior, then we will confirm it manually using the Cohere API as we did above.

Our new schema is below - note the commented lines:

question_class = {
    "class": "Question",
    "description": "Details of a Jeopardy! question",
    "moduleConfig": {
        "text2vec-cohere": {  # The vectorizer name - must match the vectorizer used
            "vectorizeClassName": False,  # Ignore class name
        },
    },
    "properties": [
        {
            "name": "answer",
            "description": "What the host prompts the contestants with.",
            "dataType": ["string"],
            "moduleConfig": {
                "text2vec-cohere": {
                    "skip": False,  # Do not skip class
                    "vectorizePropertyName": False  # Ignore property name
                }
            }
        },
        {
            "name": "question",
            "description": "What the contestant is to provide.",
            "dataType": ["string"],
            "moduleConfig": {
                "text2vec-cohere": {
                    "skip": False,  # Do not skip class
                    "vectorizePropertyName": True  # Do not ignore property name
                }
            }
        },
    ]
}
client.schema.create_class(question_class)

The schema is defined such that at least some of the options, such as moduleConfig/text2vec-cohere /vectorizeClassName and properties/moduleConfig/text2vec-cohere/vectorizePropertyName differ from their defaults.

And as a result, a nearVector search with the previously-matching Cohere API vector is now at a distance of 0.00395.

To get this back down to zero, we must revise the text generation pipeline to match the schema. Once we've done that, which looks like this:

str_in = ''
for k in sorted(input_props.keys()):
    v = input_props[k]
    if type(v) == str:
        if k == 'question':
            str_in += k + ' '
        str_in += v + ' '
str_in = str_in.lower().strip()

Searching with the vector generated from this input, the closest matching object in Weaviate once again has a distance of zero. We've come full circle 🙂.

Discussions & wrap-up

So there it is. Throughout the above journey, we saw how exactly Weaviate creates vectors from the text data objects, which is:

Vectorize properties that use string or text data types
Sorts properties in alphabetical (a-z) order before concatenating values
Prepends the class name
And converts the whole string to lowercase

And we also saw how this can be tweaked through the schema definition for each class.

One implication of this is that your vectorization requirements are a very important part of considerations in the schema definition. It may determine how you break down related data objects before importing them into Weaviate, as well as which fields you choose to import.

Let's consider again our quiz question corpus as a concrete example. Imagine that we are building a quiz app that allows our users to search for questions. Then, it may be preferable to import each quiz item into two classes, one for the question and one for the answer to avoid giving away the answer based on the user's query. But we may yet use the answer to compare to the user's input.

On the other hand, in some cases, it may be preferable to import text data with many fields as one object. This will allow the user to search for matching meanings without as much consideration given to which field the information is contained in exactly. We would be amiss to not mention other search methods such as BM25F, or hybrid searches, both of which would be affected by these decisions.

Now that you've seen exactly what happens behind the curtains, we encourage you to try applying these concepts yourself the next time you are building something with Weaviate. While the changes to the similarities were somewhat minor in our examples, in some domains and corpora their impact may be certainly larger. And tweaking the exact vectorization scheme may provide that extra boost your Weaviate instance is looking for.

Ready to start building?

Check out the Quickstart tutorial, and begin building amazing apps with the free trial of Weaviate Cloud Services (WCS).

GitHub

Forum

Slack

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Background​

Text2vec: behind the scenes​

Matching Weaviate's vectorization​

Text vectorization in Weaviate​

Tweaking text2vec vectorization in Weaviate​

Discussions & wrap-up​

Ready to start building?​