Skip to main content

Populate your Weaviate instance!

Overview


It's time to put what we've learned into action! In this section, we will:

  • Download a small dataset,
  • Build a schema corresponding to the dataset, and
  • Import it to our WCS instance.

Dataset used

We are going to use data from a popular quiz game show called Jeopardy!.

The original dataset can be found here on Kaggle, but we'll use a small subset from it, just containing 100 rows.

Here's a preview of a few rows of data.

Air DateRoundValueCategoryQuestionAnswer
02006-11-08Double Jeopardy!800AMERICAN HISTORYAbraham Lincoln died across the street from this theatre on April 15, 1865Ford's Theatre (the Ford Theatre accepted)
12005-11-18Jeopardy!200RHYME TIMEAny pigment on the wall so faded you can barely see itfaint paint
21987-06-23Double Jeopardy!600AMERICAN HISTORYAfter the original 13, this was the 1st state admitted to the unionVermont

For now, let's keep it simple by populating Weaviate with just the Round, Value, Question and Answer columns.

Exercise

Can you remember what the next steps should be?

Build a schema

The next step is to build a schema, making some decisions about how to represent our data in Weaviate.

Add class names & properties

First of all, we'll need a name. The name refers to each row or item (note: singular), so I called it JeopardyQuestion. Then, I need to define properties and types.

You saw above that we'll be populating Weaviate with Round, Value, Question and Answer columns. We need names for Weaviate properties - these names are sensible, but we follow the GraphQL convention of capitalizing classes and leaving properties as lowercases, so the names will be round, value, question and answer.

Then, we should select datatypes. All of round, question and answer are text, so we can simply choose text as our datatype. value is a number, but I know that values in Jeopardy! represent dollar amounts, meaning that they are always integers. So we'll use int.

    "class": "JeopardyQuestion",
"properties": [
{
"name": "round",
"dataType": ["text"],
# Property-level module configuration for `round`
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
# End of property-level module configuration
},
{
"name": "value",
"dataType": ["int"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],

Set & configure the vectorizer

For this example, we will obtain our object vectors using an inference service. So to do that, we must set the vectorizer for the class. We'll use text2vec-openai in this case, and we can configure the module also at the class-level.

    "vectorizer": "text2vec-openai",

Skipping a property from vectorization

You might have noticed the property-level module configuration here:

            "moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},

This configuration will exclude the round property from the vectorized text. You might be asking - why might we choose to do this?

Well, the answer is that whether the question belonged to "Jeopardy!", or "Double Jeopardy!" rounds simply do not add much to impact its meaning. You know by now that the vectorizer creates a vector representation of the object. In case of a text object, Weaviate first combines the text data according to an internal set of rules and your configuration.

It is the combined text that is vectorized. So, the difference between vectorizing the round property and skipping it would be something like this:

// If the property is vectorized
answer {answer_text} question {question_text} category {category_text}

Against:

// If the property is skipped
answer {answer_text} question {question_text}

More specifically, something like the difference between:

// If the property is vectorized
answer faint paint question any pigment on the wall so faded you can barely see it category double jeopardy!

Against:

// If the property is skipped
answer faint paint question any pigment on the wall so faded you can barely see it

The additional information is not particularly significant in capturing the meaning of the quiz item, which is mainly in the question and answer, as well as perhaps the category (not yet used).

Skipping vectorization has no impact on filtering

Importantly, excluding the round column from vectorization will have no impact on our ability to filter the results based on the round value. So if you wanted to only search a set of Double Jeopardy! questions, you still can.

Create the class

We can now add the class to the schema.

class_obj = {
# Class & property definitions
"class": "JeopardyQuestion",
"properties": [
{
"name": "round",
"dataType": ["text"],
# Property-level module configuration for `round`
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
# End of property-level module configuration
},
{
"name": "value",
"dataType": ["int"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],

# Specify a vectorizer
"vectorizer": "text2vec-openai",

# Module settings
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
}
},
}
# End class definition

client.schema.create_class(class_obj)

Now, you can check that the class has been created successfully by retrieving its schema:

client.schema.get("JeopardyQuestion")
See the full schema response
{
"class": "JeopardyQuestion",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": false
}
},
"properties": [
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": true,
"vectorizePropertyName": false
}
},
"name": "round",
"tokenization": "word"
},
{
"dataType": [
"int"
],
"indexFilterable": true,
"indexSearchable": false,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "value"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "question",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine",
"pq": {
"enabled": false,
"bitCompression": false,
"segments": 0,
"centroids": 256,
"encoder": {
"type": "kmeans",
"distribution": "log-normal"
}
}
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}
The retrieved schema is even longer!

Although we've defined a lot of details here, the retrieved schema is still longer. The additional details relate to the vector index, the inverted index, sharding and tokenization. We'll cover many of those as we go.

If you see a schema that is close to the example response - awesome! You're ready to import the data.

Import data

Here, we'll show you how to import the requisite data, including how to configure and use a batch.

Load data

We've made the data available online - so, fetch and load it like so:

import requests
import json
url = 'https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/jeopardy_100.json'
resp = requests.get(url)
data = json.loads(resp.text)

Configure batch and import data

And let's set up a batch import process. As mentioned earlier, the batch import process in Weaviate can send data in bulk and in parallel.

In Python, we recommend that you use a context manager like:

with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:

Note the use of parameters batch_size and num_workers. They specify the number of objects sent per batch, as well as how many processes are used for parallelization.

Then, the next step is to build data objects & add them to the batch process. We build objects (as Python dictionaries) by passing data from corresponding columns to the right Weaviate property, and the client will take care of when to send them.

with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
# Build data objects & add to batch
for i, row in enumerate(data):
question_object = {
"question": row["Question"],
"answer": row["Answer"],
"value": row["Value"],
"round": row["Round"],
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion"
)

Then, let's check that we've got the right number of objects imported:

assert client.query.aggregate("JeopardyQuestion").with_meta_count().do()["data"]["Aggregate"]["JeopardyQuestion"][0]["meta"]["count"] == 100

If this assertion returns True, you've successfully populated your Weaviate instance!

What happens if this runs again?

Before we go on, I have a question. What do you think will happen if you run the above import script again?

The answer is...

That you will end up with duplicate items!


Weaviate does not check if you are uploading items with the same properties as ones that exist already. And since the import script did not provide an ID, Weaviate will simply assign a new, random ID, and create new objects.

Specify object UUID

You could specify an object UUID at import time to serve as the object identifier. The Weaviate Python client, for example, provides a function to create a deterministic UUID based on an object. So, it could be added to our import script as shown below:

from weaviate.util import generate_uuid5

with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
for i, row in enumerate(data):
question_object = {
"question": row["Question"],
"answer": row["Answer"],
"value": row["Value"],
"round": row["Round"],
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion",
uuid=generate_uuid5(question_object)
)

What this will do is to create objects whose UUID is based on the object properties. Accordingly, if the object properties remain the same, so will the UUID.

Running the above script multiple times will not cause the number of objects to increase.

What is your desired behavior?

Because the UUID is based on the object properties, it will still create new objects in case some property has changed. So, when you design your import process, consider what properties might change, and how you would want Weaviate to behave in these scenarios.


Then you could, for instance, design your UUID to be created based on a subset of unique properties, to have the objects be overwritten, or alternatively have the UUID be created from the entire set of properties to only prevent duplicates.

Full import script

Putting it all together, we get the following import script:

# ===== Instantiate Weaviate client w/ auth config =====
import weaviate
from weaviate.util import generate_uuid5
import requests
import json

client = weaviate.Client(
url="https://WEAVIATE_INSTANCE_URL", # Replace with your Weaviate endpoint
auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace with your Weaviate instance API key. Delete if authentication is disabled.
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
},
)

# Define the class
class_obj = {
# Class & property definitions
"class": "JeopardyQuestion",
"properties": [
{
"name": "round",
"dataType": ["text"],
# Property-level module configuration for `round`
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
# End of property-level module configuration
},
{
"name": "value",
"dataType": ["int"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],

# Specify a vectorizer
"vectorizer": "text2vec-openai",

# Module settings
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
}
},
}
# End class definition

client.schema.create_class(class_obj)
# Finished creating the class

url = 'https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/jeopardy_100.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Context manager for batch import
with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
# Build data objects & add to batch
for i, row in enumerate(data):
question_object = {
"question": row["Question"],
"answer": row["Answer"],
"value": row["Value"],
"round": row["Round"],
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion",
uuid=generate_uuid5(question_object)
)

Review

Key takeaways

We have:

  • Downloaded a small dataset of Jeopardy! questions and answers.
  • Built a schema and imported our data.
  • Verified the successful import by checking the object count in Weaviate.

Questions and feedback

If you have any questions or feedback, please let us know on our forum. For example, you can: