Skip to main content

Import

LICENSE Weaviate on Stackoverflow badge Weaviate issues on Github badge Weaviate version badge Weaviate total Docker pulls badge Go Report Card

Although importing itself is pretty straightforward, creating an optimized import strategy needs a bit of planning on your end. Hence, before we start with this guide, there are a few things to keep in mind.

  1. When importing, you want to make sure that you max out all the CPUs available. It's more often than not the case that the import script is the bottleneck. 0. Tip, use htop when importing to see if all CPUs are maxed out. 0. Learn more about how to plan your setup here.
  2. Use parallelization; if the CPUs are not maxed out, just add another import process.
  3. For Kubernetes, fewer large machines are faster than more small machines, simply because of network latency.

Importing​

First of all, some rules of thumb.

  • You should always use batch import.
  • As mentioned above, max out your CPUs (on the Weaviate cluster). Often your import script is the bottleneck.
  • Process error messages.
  • Some clients (especially Python) have some built-in logic to efficiently regulate batch importing.

Assuming that you've read the schema quickstart tutorial, you import data based on the classes and properties defined in the schema.

For the purpose of this tutorial, we've prepared a data.json file, which contains a few Authors and Publications. Download it, and add it to your project.

Download data.json

Steps to follow​

Now, to import the data we need to follow these steps:

  1. Connect to your Weaviate instance
  2. Load objects from the data.json file
  3. Prepare a batch process
  4. Loop through all Publications
    • Parse each publication – to a structure expected by the language client of your choice
    • Push the object through a batch process
  5. Loop through all Authors
    • Parse each author – to a structure expected by the language client of your choice
    • Push the object through a batch process
  6. Flush the batch process – in case there are any remaining objects in the buffer

Here is the full code you need to import the Publications (note, the importAuthors example is shorter).

import weaviate

client = weaviate.Client("https://some-endpoint.semi.network/")

# Load data from the data.json file
data_file = open("data.json")
data = json.load(data_file)
# Closing file
data_file.close()

# Configure a batch process
client.batch.configure(
batch_size=100,
dynamic=True,
timeout_retries=3,
callback=None,
)

# Batch import all Publications
with client.batch as batch:
for publication in data["publications"]:
print("importing publication: ", publication["name"])

properties = {
"name": publication["name"]
}

client.batch.add_data_object(properties, "Publication", publication["id"], publication["vector"])

And here is the code to import Authors.


# Batch import all Authors
for author in data["authors"]:
print("importing author: ", author["name"])

properties = {
"name": author["name"],
"age": author["age"],
"born": author["born"],
"wonNobelPrize": author["wonNobelPrize"],
"description": author["description"],
}

client.batch.add_data_object(properties, "Author", author["id"], author["vector"])

# Flush the remaining buffer to make sure all objects are imported
client.batch.flush()

You can quickly check the imported object by opening – weaviate-endpoint/v1/objects in a browser, like this:

https://some-endpoint.weaviate.network/v1/objects

Or you can read the objects in your project, like this:

import weaviate
import json

client = weaviate.Client("https://some-endpoint.weaviate.network/")

some_objects = client.data_object.get()
print(json.dumps(some_objects))

Other object operations​

All other CRUD object operations are available in the objects RESTful API documentation and the batch RESTful API documentation.

Recap​

Importing into Weaviate needs some planning on your side. In almost all cases, you want to use the batch endpoint to create data objects. More often than not, the bottleneck sits in the import script and not in Weaviate. Try to optimize for maxing out all CPUs to get the fastest import speeds.

More Resources​

If you can't find the answer to your question here, please look at the:

  1. Frequently Asked Questions. Or,
  2. Knowledge base of old issues. Or,
  3. For questions: Stackoverflow. Or,
  4. For issues: Github. Or,
  5. Ask your question in the Slack channel: Slack.