Populate your Weaviate instance!

Overview

It's time to put what we've learned into action! In this section, we will:

Download a small dataset,
Build a schema corresponding to the dataset, and
Import it to your WCD instance.

Dataset used

We are going to use data from a popular quiz game show called Jeopardy!.

The original dataset can be found here on Kaggle, but we'll use a small subset from it, just containing 100 rows.

Here's a preview of a few rows of data.

	Air Date	Round	Value	Category	Question	Answer
0	2006-11-08	Double Jeopardy!	800	AMERICAN HISTORY	Abraham Lincoln died across the street from this theatre on April 15, 1865	Ford's Theatre (the Ford Theatre accepted)
1	2005-11-18	Jeopardy!	200	RHYME TIME	Any pigment on the wall so faded you can barely see it	faint paint
2	1987-06-23	Double Jeopardy!	600	AMERICAN HISTORY	After the original 13, this was the 1st state admitted to the union	Vermont

For now, let's keep it simple by populating Weaviate with just the Round, Value, Question and Answer columns.

Exercise

Can you remember what the next steps should be?

Build a schema

The next step is to build a schema, making some decisions about how to represent our data in Weaviate.

Add class names & properties

First of all, we'll need a name. The name refers to each row or item (note: singular), so I called it JeopardyQuestion. Then, I need to define properties and types.

You saw above that we'll be populating Weaviate with Round, Value, Question and Answer columns. We need names for Weaviate properties - these names are sensible, but we follow the GraphQL convention of capitalizing classes and leaving properties as lowercases, so the names will be round, value, question and answer.

Then, we should select datatypes. All of round, question and answer are text, so we can simply choose text as our datatype. value is a number, but I know that values in Jeopardy! represent dollar amounts, meaning that they are always integers. So we'll use int.

Python

    "class": "JeopardyQuestion",
    "properties": [
        {
            "name": "round",
            "dataType": ["text"],
            # Property-level module configuration for `round`
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": True,
                }
            },
            # End of property-level module configuration
        },
        {
            "name": "value",
            "dataType": ["int"],
        },
        {
            "name": "question",
            "dataType": ["text"],
        },
        {
            "name": "answer",
            "dataType": ["text"],
        },
    ],

API docs

Set & configure the vectorizer

For this example, we will obtain our object vectors using an inference service. So to do that, we must set the vectorizer for the class. We'll use text2vec-openai in this case, and we can configure the module also at the class-level.

Python

    "vectorizer": "text2vec-openai",

API docs

Skipping a property from vectorization

You might have noticed the property-level module configuration here:

Python

            "moduleConfig": {
                "text2vec-openai": {
                    "skip": True,
                }
            },

API docs

This configuration will exclude the round property from the vectorized text. You might be asking - why might we choose to do this?

Well, the answer is that whether the question belonged to "Jeopardy!", or "Double Jeopardy!" rounds simply do not add much to impact its meaning. You know by now that the vectorizer creates a vector representation of the object. In case of a text object, Weaviate first combines the text data according to an internal set of rules and your configuration.

It is the combined text that is vectorized. So, the difference between vectorizing the round property and skipping it would be something like this:

// If the property is vectorized
answer {answer_text} question {question_text} category {category_text}

Against:

// If the property is skipped
answer {answer_text} question {question_text}

More specifically, something like the difference between:

// If the property is vectorized
answer faint paint question any pigment on the wall so faded you can barely see it category double jeopardy!

Against:

// If the property is skipped
answer faint paint question any pigment on the wall so faded you can barely see it

The additional information is not particularly significant in capturing the meaning of the quiz item, which is mainly in the question and answer, as well as perhaps the category (not yet used).

Skipping vectorization has no impact on filtering

Importantly, excluding the round column from vectorization will have no impact on our ability to filter the results based on the round value. So if you wanted to only search a set of Double Jeopardy! questions, you still can.

Create the class

We can now add the class to the schema.

Python

class_obj = {
    # Class & property definitions
    "class": "JeopardyQuestion",
    "properties": [
        {
            "name": "round",
            "dataType": ["text"],
            # Property-level module configuration for `round`
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": True,
                }
            },
            # End of property-level module configuration
        },
        {
            "name": "value",
            "dataType": ["int"],
        },
        {
            "name": "question",
            "dataType": ["text"],
        },
        {
            "name": "answer",
            "dataType": ["text"],
        },
    ],

    # Specify a vectorizer
    "vectorizer": "text2vec-openai",

    # Module settings
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": False,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        }
    },
}
# End class definition

client.schema.create_class(class_obj)

API docs

Now, you can check that the class has been created successfully by retrieving its schema:

Python

client.schema.get("JeopardyQuestion")

API docs

See the full schema response

{
  "class": "JeopardyQuestion",
  "invertedIndexConfig": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanupIntervalSeconds": 60,
    "stopwords": {
      "additions": null,
      "preset": "en",
      "removals": null
    }
  },
  "moduleConfig": {
    "text2vec-openai": {
      "model": "ada",
      "modelVersion": "002",
      "type": "text",
      "vectorizeClassName": false
    }
  },
  "properties": [
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": true,
          "vectorizePropertyName": false
        }
      },
      "name": "round",
      "tokenization": "word"
    },
    {
      "dataType": [
        "int"
      ],
      "indexFilterable": true,
      "indexSearchable": false,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "value"
    },
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "question",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "answer",
      "tokenization": "word"
    }
  ],
  "replicationConfig": {
    "factor": 1
  },
  "shardingConfig": {
    "virtualPerPhysical": 128,
    "desiredCount": 1,
    "actualCount": 1,
    "desiredVirtualCount": 128,
    "actualVirtualCount": 128,
    "key": "_id",
    "strategy": "hash",
    "function": "murmur3"
  },
  "vectorIndexConfig": {
    "skip": false,
    "cleanupIntervalSeconds": 300,
    "maxConnections": 32,
    "efConstruction": 128,
    "ef": -1,
    "dynamicEfMin": 100,
    "dynamicEfMax": 500,
    "dynamicEfFactor": 8,
    "vectorCacheMaxObjects": 1000000000000,
    "flatSearchCutoff": 40000,
    "distance": "cosine",
    "pq": {
      "enabled": false,
      "segments": 0,
      "centroids": 256,
      "encoder": {
        "type": "kmeans",
        "distribution": "log-normal"
      }
    }
  },
  "vectorIndexType": "hnsw",
  "vectorizer": "text2vec-openai"
}

The retrieved schema is even longer!

Although we've defined a lot of details here, the retrieved schema is still longer. The additional details relate to the vector index, the inverted index, sharding and tokenization. We'll cover many of those as we go.

If you see a schema that is close to the example response - awesome! You're ready to import the data.

Import data

Here, we'll show you how to import the requisite data, including how to configure and use a batch.

Load data

We've made the data available online - so, fetch and load it like so:

import requests
import json
url = 'https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/jeopardy_100.json'
resp = requests.get(url)
data = json.loads(resp.text)

API docs

Configure batch and import data

And let's set up a batch import process. As mentioned earlier, the batch import process in Weaviate can send data in bulk and in parallel.

In Python, we recommend that you use a context manager like:

with client.batch(
    batch_size=200,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:

API docs

Note the use of parameters batch_size and num_workers. They specify the number of objects sent per batch, as well as how many processes are used for parallelization.

Then, the next step is to build data objects & add them to the batch process. We build objects (as Python dictionaries) by passing data from corresponding columns to the right Weaviate property, and the client will take care of when to send them.

with client.batch(
    batch_size=200,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:
    # Build data objects & add to batch
    for i, row in enumerate(data):
        question_object = {
            "question": row["Question"],
            "answer": row["Answer"],
            "value": row["Value"],
            "round": row["Round"],
        }
        batch.add_data_object(
            question_object,
            class_name="JeopardyQuestion"
        )

API docs

Then, let's check that we've got the right number of objects imported:

assert client.query.aggregate("JeopardyQuestion").with_meta_count().do()["data"]["Aggregate"]["JeopardyQuestion"][0]["meta"]["count"] == 100

API docs

If this assertion returns True, you've successfully populated your Weaviate instance!

What happens if this runs again?

Before we go on, I have a question. What do you think will happen if you run the above import script again?

The answer is...

That you will end up with duplicate items!

Weaviate does not check if you are uploading items with the same properties as ones that exist already. And since the import script did not provide an ID, Weaviate will simply assign a new, random ID, and create new objects.

Specify object UUID

You could specify an object UUID at import time to serve as the object identifier. The Weaviate Python client, for example, provides a function to create a deterministic UUID based on an object. So, it could be added to our import script as shown below:

from weaviate.util import generate_uuid5

with client.batch(
    batch_size=200,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:
    for i, row in enumerate(data):
        question_object = {
            "question": row["Question"],
            "answer": row["Answer"],
            "value": row["Value"],
            "round": row["Round"],
        }
        batch.add_data_object(
            question_object,
            class_name="JeopardyQuestion",
            uuid=generate_uuid5(question_object)
        )

API docs

What this will do is to create objects whose UUID is based on the object properties. Accordingly, if the object properties remain the same, so will the UUID.

Running the above script multiple times will not cause the number of objects to increase.

What is your desired behavior?

Because the UUID is based on the object properties, it will still create new objects in case some property has changed. So, when you design your import process, consider what properties might change, and how you would want Weaviate to behave in these scenarios.

Then you could, for instance, design your UUID to be created based on a subset of unique properties, to have the objects be overwritten, or alternatively have the UUID be created from the entire set of properties to only prevent duplicates.

Full import script

Putting it all together, we get the following import script:

# ===== Instantiate Weaviate client w/ auth config =====
import weaviate
from weaviate.util import generate_uuid5
import requests
import json

client = weaviate.Client(
    url="https://WEAVIATE_INSTANCE_URL",  # Replace with your Weaviate endpoint
    auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"),  # Replace with your Weaviate instance API key. Delete if authentication is disabled.
    additional_headers={
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
    },
)

# Define the class
class_obj = {
    # Class & property definitions
    "class": "JeopardyQuestion",
    "properties": [
        {
            "name": "round",
            "dataType": ["text"],
            # Property-level module configuration for `round`
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": True,
                }
            },
            # End of property-level module configuration
        },
        {
            "name": "value",
            "dataType": ["int"],
        },
        {
            "name": "question",
            "dataType": ["text"],
        },
        {
            "name": "answer",
            "dataType": ["text"],
        },
    ],

    # Specify a vectorizer
    "vectorizer": "text2vec-openai",

    # Module settings
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": False,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        }
    },
}
# End class definition

client.schema.create_class(class_obj)
# Finished creating the class

url = 'https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/jeopardy_100.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Context manager for batch import
with client.batch(
    batch_size=200,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:
    # Build data objects & add to batch
    for i, row in enumerate(data):
        question_object = {
            "question": row["Question"],
            "answer": row["Answer"],
            "value": row["Value"],
            "round": row["Round"],
        }
        batch.add_data_object(
            question_object,
            class_name="JeopardyQuestion",
            uuid=generate_uuid5(question_object)
        )

API docs

Review

Key takeaways

We have:

Downloaded a small dataset of Jeopardy! questions and answers.
Built a schema and imported our data.
Verified the successful import by checking the object count in Weaviate.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Overview​

Dataset used​

Build a schema​

Add class names & properties​

Set & configure the vectorizer​

Skipping a property from vectorization​

Create the class​

Import data​

Load data​

Configure batch and import data​

What happens if this runs again?​

Specify object UUID​

Full import script​

Review​

Key takeaways​

Questions and feedback​

Overview

Dataset used

Build a schema

Add class names & properties

Set & configure the vectorizer

Skipping a property from vectorization

Create the class

Import data

Load data

Configure batch and import data

What happens if this runs again?

Specify object UUID

Full import script

Review

Key takeaways

Questions and feedback