Imports in detail

🚧 To be updated 🚧

This tutorial is currently being updated to reflect the latest features and improvements in Weaviate. We appreciate your patience and invite you to check back soon for the updated content.

In this section, we will explore data import, including details of the batch import process. We will discuss points such as how vectors are imported, what a batch import is, how to manage errors, and some advice on optimization.

Prerequisites

Before you start this tutorial, you should follow the steps in the tutorials to have:

An instance of Weaviate running (e.g. on the Weaviate Cloud),
An API key for your preferred inference API, such as OpenAI, Cohere, or Hugging Face,
Installed your preferred Weaviate client library, and
Set up a Question class in your schema.
- You can follow the Quickstart guide, or the schema tutorial to construct the Question class if you have not already.

We will use the dataset below. We suggest that you download it to your working directory.

Download jeopardy_tiny.json

Import setup

As mentioned in the schema tutorial, the schema specifies the data structure for Weaviate.

So the data import must map properties of each record to those of the relevant class in the schema. In this case, the relevant class is Question as defined in the previous section.

Data object structure

Each Weaviate data object is structured as follows:

{
  "class": "<class name>",  // as defined during schema creation
  "id": "<UUID>",     // optional, must be in UUID format.
  "properties": {
    "<property name>": "<property value>", // specified in dataType defined during schema creation
  }
}

Most commonly, Weaviate users import data through a Weaviate client library.

It is worth noting, however, that data is ultimately added through the RESTful API, either through the objects endpoint or the batch endpoint.

As the names suggest, the use of these endpoints depend on whether objects are being imported as batches or individually.

To batch or not to batch

For importing data, we strongly suggest that you use batch imports unless you have a specific reason not to. Batch imports can greatly improve performance by sending multiple objects in a single request.

We note that batch imports are carried out through the batch REST endpoint.

Batch import process

A batch import process generally looks like this:

Connect to your Weaviate instance
Load objects from the data file
Prepare a batch process
Loop through the records
1. Parse each record and build an object
2. Push the object through a batch process
Flush the batch process – in case there are any remaining objects in the buffer

Here is the full code you need to import the Question objects:

Python
JS/TS Client v2

import weaviate
import json

client = weaviate.Client(
    url="https://WEAVIATE_INSTANCE_URL/",  # Replace with your Weaviate endpoint
    additional_headers={
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"  # Or "X-Cohere-Api-Key" or "X-HuggingFace-Api-Key"
    }
)

# ===== import data =====
# Load data
import requests
url = 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Prepare a batch process
client.batch.configure(batch_size=100)  # Configure batch
with client.batch as batch:
    # Batch import all Questions
    for i, d in enumerate(data):
        # print(f"importing question: {i+1}")  # To see imports

        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }

        batch.add_data_object(properties, "Question")

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: 'WEAVIATE_INSTANCE_URL',  // Replace with your Weaviate endpoint
  headers: { 'X-OpenAI-Api-Key': 'YOUR-OPENAI-API-KEY' },  // Replace with your API key
});

async function getJsonData() {
  const file = await fetch('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json');
  return file.json();
}

async function importQuestions() {
  // Get the data from the data.json file
  const data = await getJsonData();

  // Prepare a batcher
  let batcher = client.batch.objectsBatcher();
  let counter = 0;
  const batchSize = 100;

  for (const question of data) {
    // Construct an object with a class and properties 'answer' and 'question'
    const obj = {
      class: 'Question',
      properties: {
        answer: question.Answer,
        question: question.Question,
        category: question.Category,
      },
    };

    // add the object to the batch queue
    batcher = batcher.withObject(obj);

    // When the batch counter reaches batchSize, push the objects to Weaviate
    if (counter++ == batchSize) {
      // flush the batch queue
      await batcher.do();

      // restart the batch queue
      counter = 0;
      batcher = client.batch.objectsBatcher();
    }
  }

  // Flush the remaining objects
  await batcher.do();
}

await importQuestions();

There are a couple of things to note here.

Batch size

Some clients include this as a parameter (e.g. batch_size in the Python client), or it can be manually set by periodically flushing the batch.

Typically, a size between 20 and 100 is a reasonable starting point, although this depends on the size of each data object. A smaller size may be preferable for larger data objects, such as if vectors are included in each object upload.

Where are the vectors?

You may have noticed that we do not provide a vector. As a vectorizer is specified in our schema, Weaviate will send a request to the appropriate module (text2vec-openai in this case) to vectorize the data, and the vector in the response will be indexed and saved as a part of the data object.

Bring your own vectors

If you wish to upload your own vectors, you can do so with Weaviate. Refer to the this page.

You can also manually upload existing vectors and use a vectorizer module for vectorizing queries.

Confirm data import

You can quickly check the imported object by opening <weaviate-endpoint>/v1/objects in a browser, like this (replace with your Weaviate endpoint):

https://some-endpoint.semi.network/v1/objects

Or you can read the objects in your project, like this:

Python
JS/TS Client v2

import weaviate
import json

client = weaviate.Client("https://WEAVIATE_INSTANCE_URL/")  # Replace with your Weaviate endpoint
some_objects = client.data_object.get()
print(json.dumps(some_objects))

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'https',
  host: 'WEAVIATE_INSTANCE_URL',  // Replace with your Weaviate endpoint
});

const response = await client
  .data
  .getter()
  .do();
console.log(JSON.stringify(response, null, 2));

The result should look something like this:

{
    "deprecations": null,
    "objects": [
        ...  // Details of each object
    ],
    "totalResults": 10  // You should see 10 results here
}

Data import - best practices

When importing large datasets, it may be worth planning out an optimized import strategy. Here are a few things to keep in mind.

The most likely bottleneck is the import script. Accordingly, aim to max out all the CPUs available.
To use multiple CPUs efficiently, enable sharding when you import data. For the fastest imports, enable sharding even on a single node.
Use parallelization; if the CPUs are not maxed out, just add another import process.
Use htop when importing to see if all CPUs are maxed out.
To avoid out-of-memory issues during imports, set LIMIT_RESOURCES to True or configure the GOMEMLIMIT environment variable. For details, see Environment variables.
For Kubernetes, a few large machines are faster than many small machines (due to network latency).

Our rules of thumb are:

You should always use batch import.
Use multiple shards.
As mentioned above, max out your CPUs (on the Weaviate cluster). Often your import script is the bottleneck.
Process error messages.
Some clients (e.g. Python) have some built-in logic to efficiently control batch importing.

Error handling

We recommend that you implement error handling at an object level, such as in this example.

200 status code != 100% batch success

It is important to note that an HTTP 200 status code only indicates that the request has been successfully sent to Weaviate. In other words, there were no issues with the connection or processing of the batch and no malformed request.

A request with a 200 response may still include object-level errors, which is why error handling is critical.

Recap

Data to be imported should match the database schema
Use batch import unless you have a good reason not to
For importing large datasets, make sure to consider and optimize your import strategy.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Imports in detail

Prerequisites

Import setup

Data object structure

To batch or not to batch

Batch import process

Batch size

Where are the vectors?

Bring your own vectors

Confirm data import

Data import - best practices

Error handling

Recap

Suggested reading

Other object operations

Questions and feedback

Prerequisites​

Import setup​

Data object structure​

To batch or not to batch​

Batch import process​

Batch size​

Where are the vectors?​

Bring your own vectors​

Confirm data import​

Data import - best practices​

Error handling​

Recap​

Suggested reading​

Other object operations​

Questions and feedback​

Prerequisites

Import setup

Data object structure

To batch or not to batch

Batch import process

Batch size

Where are the vectors?

Bring your own vectors

Confirm data import

Data import - best practices

Error handling

Recap

Suggested reading

Other object operations

Questions and feedback