Populate your Weaviate instance!
Overview
It's time to put what we've learned into action! In this section, we will:
- Download a small dataset,
- Build a schema corresponding to the dataset, and
- Import it to your WCD instance.
Dataset used
We are going to use data from a popular quiz game show called Jeopardy!.
The original dataset can be found here on Kaggle, but we'll use a small subset from it, just containing 100 rows.
Here's a preview of a few rows of data.
Air Date | Round | Value | Category | Question | Answer | |
---|---|---|---|---|---|---|
0 | 2006-11-08 | Double Jeopardy! | 800 | AMERICAN HISTORY | Abraham Lincoln died across the street from this theatre on April 15, 1865 | Ford's Theatre (the Ford Theatre accepted) |
1 | 2005-11-18 | Jeopardy! | 200 | RHYME TIME | Any pigment on the wall so faded you can barely see it | faint paint |
2 | 1987-06-23 | Double Jeopardy! | 600 | AMERICAN HISTORY | After the original 13, this was the 1st state admitted to the union | Vermont |
For now, let's keep it simple by populating Weaviate with just the Round
, Value
, Question
and Answer
columns.
Can you remember what the next steps should be?
Build a schema
The next step is to build a schema, making some decisions about how to represent our data in Weaviate.
Add class names & properties
First of all, we'll need a name. The name refers to each row or item (note: singular), so I called it JeopardyQuestion
. Then, I need to define properties and types.
You saw above that we'll be populating Weaviate with Round
, Value
, Question
and Answer
columns. We need names for Weaviate properties
- these names are sensible, but we follow the GraphQL convention of capitalizing classes and leaving properties as lowercases, so the names will be round
, value
, question
and answer
.
Then, we should select datatypes. All of round
, question
and answer
are text, so we can simply choose text
as our datatype. value
is a number, but I know that values in Jeopardy! represent dollar amounts, meaning that they are always integers. So we'll use int
.
- Python
"class": "JeopardyQuestion",
"properties": [
{
"name": "round",
"dataType": ["text"],
# Property-level module configuration for `round`
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
# End of property-level module configuration
},
{
"name": "value",
"dataType": ["int"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],
Set & configure the vectorizer
For this example, we will obtain our object vectors using an inference service. So to do that, we must set the vectorizer
for the class. We'll use text2vec-openai
in this case, and we can configure the module also at the class-level.
- Python
"vectorizer": "text2vec-openai",
Skipping a property from vectorization
You might have noticed the property-level module configuration here:
- Python
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
This configuration will exclude the round
property from the vectorized text. You might be asking - why might we choose to do this?
Well, the answer is that whether the question belonged to "Jeopardy!", or "Double Jeopardy!" rounds simply do not add much to impact its meaning. You know by now that the vectorizer creates a vector representation of the object. In case of a text object, Weaviate first combines the text data according to an internal set of rules and your configuration.
It is the combined text that is vectorized. So, the difference between vectorizing the round
property and skipping it would be something like this:
// If the property is vectorized
answer {answer_text} question {question_text} category {category_text}
Against:
// If the property is skipped
answer {answer_text} question {question_text}
More specifically, something like the difference between:
// If the property is vectorized
answer faint paint question any pigment on the wall so faded you can barely see it category double jeopardy!
Against:
// If the property is skipped
answer faint paint question any pigment on the wall so faded you can barely see it
The additional information is not particularly significant in capturing the meaning of the quiz item, which is mainly in the question and answer, as well as perhaps the category (not yet used).
Importantly, excluding the round
column from vectorization will have no impact on our ability to filter the results based on the round
value. So if you wanted to only search a set of Double Jeopardy!
questions, you still can.
Create the class
We can now add the class to the schema.
- Python
class_obj = {
# Class & property definitions
"class": "JeopardyQuestion",
"properties": [
{
"name": "round",
"dataType": ["text"],
# Property-level module configuration for `round`
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
# End of property-level module configuration
},
{
"name": "value",
"dataType": ["int"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],
# Specify a vectorizer
"vectorizer": "text2vec-openai",
# Module settings
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
}
},
}
# End class definition
client.schema.create_class(class_obj)
Now, you can check that the class has been created successfully by retrieving its schema:
- Python
client.schema.get("JeopardyQuestion")
See the full schema response
{
"class": "JeopardyQuestion",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": false
}
},
"properties": [
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": true,
"vectorizePropertyName": false
}
},
"name": "round",
"tokenization": "word"
},
{
"dataType": [
"int"
],
"indexFilterable": true,
"indexSearchable": false,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "value"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "question",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 32,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine",
"pq": {
"enabled": false,
"segments": 0,
"centroids": 256,
"encoder": {
"type": "kmeans",
"distribution": "log-normal"
}
}
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}
Although we've defined a lot of details here, the retrieved schema is still longer. The additional details relate to the vector index, the inverted index, sharding and tokenization. We'll cover many of those as we go.
If you see a schema that is close to the example response - awesome! You're ready to import the data.
Import data
Here, we'll show you how to import the requisite data, including how to configure and use a batch.
Load data
We've made the data available online - so, fetch and load it like so:
import requests
import json
url = 'https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/jeopardy_100.json'
resp = requests.get(url)
data = json.loads(resp.text)
Configure batch and import data
And let's set up a batch import process. As mentioned earlier, the batch import process in Weaviate can send data in bulk and in parallel.
In Python, we recommend that you use a context manager like:
with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
Note the use of parameters batch_size
and num_workers
. They specify the number of objects sent per batch, as well as how many processes are used for parallelization.
Then, the next step is to build data objects & add them to the batch process. We build objects (as Python dictionaries) by passing data from corresponding columns to the right Weaviate property, and the client will take care of when to send them.
with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
# Build data objects & add to batch
for i, row in enumerate(data):
question_object = {
"question": row["Question"],
"answer": row["Answer"],
"value": row["Value"],
"round": row["Round"],
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion"
)
Then, let's check that we've got the right number of objects imported:
assert client.query.aggregate("JeopardyQuestion").with_meta_count().do()["data"]["Aggregate"]["JeopardyQuestion"][0]["meta"]["count"] == 100
If this assertion returns True
, you've successfully populated your Weaviate instance!
What happens if this runs again?
Before we go on, I have a question. What do you think will happen if you run the above import script again?
The answer is...
That you will end up with duplicate items!
Weaviate does not check if you are uploading items with the same properties as ones that exist already. And since the import script did not provide an ID, Weaviate will simply assign a new, random ID, and create new objects.
Specify object UUID
You could specify an object UUID at import time to serve as the object identifier. The Weaviate Python client, for example, provides a function to create a deterministic UUID based on an object. So, it could be added to our import script as shown below:
from weaviate.util import generate_uuid5
with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
for i, row in enumerate(data):
question_object = {
"question": row["Question"],
"answer": row["Answer"],
"value": row["Value"],
"round": row["Round"],
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion",
uuid=generate_uuid5(question_object)
)
What this will do is to create objects whose UUID is based on the object properties. Accordingly, if the object properties remain the same, so will the UUID.
Running the above script multiple times will not cause the number of objects to increase.
Because the UUID is based on the object properties, it will still create new objects in case some property has changed. So, when you design your import process, consider what properties might change, and how you would want Weaviate to behave in these scenarios.
Then you could, for instance, design your UUID to be created based on a subset of unique properties, to have the objects be overwritten, or alternatively have the UUID be created from the entire set of properties to only prevent duplicates.
Full import script
Putting it all together, we get the following import script:
# ===== Instantiate Weaviate client w/ auth config =====
import weaviate
from weaviate.util import generate_uuid5
import requests
import json
client = weaviate.Client(
url="https://WEAVIATE_INSTANCE_URL", # Replace with your Weaviate endpoint
auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace with your Weaviate instance API key. Delete if authentication is disabled.
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
},
)
# Define the class
class_obj = {
# Class & property definitions
"class": "JeopardyQuestion",
"properties": [
{
"name": "round",
"dataType": ["text"],
# Property-level module configuration for `round`
"moduleConfig": {
"text2vec-openai": {
"skip": True,
}
},
# End of property-level module configuration
},
{
"name": "value",
"dataType": ["int"],
},
{
"name": "question",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],
# Specify a vectorizer
"vectorizer": "text2vec-openai",
# Module settings
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
}
},
}
# End class definition
client.schema.create_class(class_obj)
# Finished creating the class
url = 'https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/jeopardy_100.json'
resp = requests.get(url)
data = json.loads(resp.text)
# Context manager for batch import
with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the process
) as batch:
# Build data objects & add to batch
for i, row in enumerate(data):
question_object = {
"question": row["Question"],
"answer": row["Answer"],
"value": row["Value"],
"round": row["Round"],
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion",
uuid=generate_uuid5(question_object)
)
Review
Key takeaways
We have:
- Downloaded a small dataset of Jeopardy! questions and answers.
- Built a schema and imported our data.
- Verified the successful import by checking the object count in Weaviate.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.