Imports in detail
Overview
In this section, we will explore data import, including details of the batch import process. We will discuss points such as how vectors are imported, what a batch import is, how to manage errors, and some advice on optimization.
Prerequisites
Before you start this tutorial, you should follow the steps in the tutorials to have:
- An instance of Weaviate running (e.g. on the Weaviate Cloud),
- An API key for your preferred inference API, such as OpenAI, Cohere, or Hugging Face,
- Installed your preferred Weaviate client library, and
- Set up a
Question
class in your schema.- You can follow the Quickstart guide, or the schema tutorial to construct the Question class if you have not already.
We will use the dataset below. We suggest that you download it to your working directory.
Import setup
As mentioned in the schema tutorial, the schema
specifies the data structure for Weaviate.
So the data import must map properties of each record to those of the relevant class in the schema. In this case, the relevant class is Question as defined in the previous section.
Data object structure
Each Weaviate data object is structured as follows:
{
"class": "<class name>", // as defined during schema creation
"id": "<UUID>", // optional, must be in UUID format.
"properties": {
"<property name>": "<property value>", // specified in dataType defined during schema creation
}
}
Most commonly, Weaviate users import data through a Weaviate client library.
It is worth noting, however, that data is ultimately added through the RESTful API, either through the objects
endpoint or the batch
endpoint.
As the names suggest, the use of these endpoints depend on whether objects are being imported as batches or individually.
To batch or not to batch
For importing data, we strongly suggest that you use batch imports unless you have a specific reason not to. Batch imports can greatly improve performance by sending multiple objects in a single request.
We note that batch imports are carried out through the batch
REST endpoint.
Batch import process
A batch import process generally looks like this:
- Connect to your Weaviate instance
- Load objects from the data file
- Prepare a batch process
- Loop through the records
- Parse each record and build an object
- Push the object through a batch process
- Flush the batch process – in case there are any remaining objects in the buffer
Here is the full code you need to import the Question objects:
- Python
- JS/TS Client v2
import weaviate
import json
client = weaviate.Client(
url="https://WEAVIATE_INSTANCE_URL/", # Replace with your Weaviate endpoint
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY" # Or "X-Cohere-Api-Key" or "X-HuggingFace-Api-Key"
}
)
# ===== import data =====
# Load data
import requests
url = 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)
# Prepare a batch process
client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch:
# Batch import all Questions
for i, d in enumerate(data):
# print(f"importing question: {i+1}") # To see imports
properties = {
"answer": d["Answer"],
"question": d["Question"],
"category": d["Category"],
}
batch.add_data_object(properties, "Question")
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
headers: { 'X-OpenAI-Api-Key': 'YOUR-OPENAI-API-KEY' }, // Replace with your API key
});
async function getJsonData() {
const file = await fetch('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json');
return file.json();
}
async function importQuestions() {
// Get the data from the data.json file
const data = await getJsonData();
// Prepare a batcher
let batcher = client.batch.objectsBatcher();
let counter = 0;
const batchSize = 100;
for (const question of data) {
// Construct an object with a class and properties 'answer' and 'question'
const obj = {
class: 'Question',
properties: {
answer: question.Answer,
question: question.Question,
category: question.Category,
},
};
// add the object to the batch queue
batcher = batcher.withObject(obj);
// When the batch counter reaches batchSize, push the objects to Weaviate
if (counter++ == batchSize) {
// flush the batch queue
await batcher.do();
// restart the batch queue
counter = 0;
batcher = client.batch.objectsBatcher();
}
}
// Flush the remaining objects
await batcher.do();
}
await importQuestions();
There are a couple of things to note here.
Batch size
Some clients include this as a parameter (e.g. batch_size
in the Python client), or it can be manually set by periodically flushing the batch.
Typically, a size between 20 and 100 is a reasonable starting point, although this depends on the size of each data object. A smaller size may be preferable for larger data objects, such as if vectors are included in each object upload.
Where are the vectors?
You may have noticed that we do not provide a vector. As a vectorizer
is specified in our schema, Weaviate will send a request to the appropriate module (text2vec-openai
in this case) to vectorize the data, and the vector in the response will be indexed and saved as a part of the data object.
Bring your own vectors
If you wish to upload your own vectors, you can do so with Weaviate. Refer to the this page.
You can also manually upload existing vectors and use a vectorizer module for vectorizing queries.
Confirm data import
You can quickly check the imported object by opening <weaviate-endpoint>/v1/objects
in a browser, like this (replace with your Weaviate endpoint):
https://some-endpoint.semi.network/v1/objects
Or you can read the objects in your project, like this:
- Python
- JS/TS Client v2
import weaviate
import json
client = weaviate.Client("https://WEAVIATE_INSTANCE_URL/") # Replace with your Weaviate endpoint
some_objects = client.data_object.get()
print(json.dumps(some_objects))
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
});
const response = await client
.data
.getter()
.do();
console.log(JSON.stringify(response, null, 2));
The result should look something like this:
{
"deprecations": null,
"objects": [
... // Details of each object
],
"totalResults": 10 // You should see 10 results here
}
Data import - best practices
When importing large datasets, it may be worth planning out an optimized import strategy. Here are a few things to keep in mind.
- The most likely bottleneck is the import script. Accordingly, aim to max out all the CPUs available.
- To use multiple CPUs efficiently, enable sharding when you import data. For the fastest imports, enable sharding even on a single node.
- Use parallelization; if the CPUs are not maxed out, just add another import process.
- Use
htop
when importing to see if all CPUs are maxed out. - To avoid out-of-memory issues during imports, set
LIMIT_RESOURCES
toTrue
or configure theGOMEMLIMIT
environment variable. For details, see Environment variables. - For Kubernetes, a few large machines are faster than many small machines (due to network latency).
Our rules of thumb are:
- You should always use batch import.
- Use multiple shards.
- As mentioned above, max out your CPUs (on the Weaviate cluster). Often your import script is the bottleneck.
- Process error messages.
- Some clients (e.g. Python) have some built-in logic to efficiently control batch importing.
Error handling
We recommend that you implement error handling at an object level, such as in this example.
200
status code != 100% batch successIt is important to note that an HTTP 200
status code only indicates that the request has been successfully sent to Weaviate. In other words, there were no issues with the connection or processing of the batch and no malformed request.
A request with a 200
response may still include object-level errors, which is why error handling is critical.
Recap
- Data to be imported should match the database schema
- Use batch import unless you have a good reason not to
- For importing large datasets, make sure to consider and optimize your import strategy.
Suggested reading
- Tutorial: Schemas in detail
- Tutorial: Queries in detail
- Tutorial: Introduction to modules
- Tutorial: Introduction to Weaviate Console
Other object operations
All other CRUD object operations are available in the manage-data section.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.