Skip to main content

Batch import

Batch imports are an efficient way to add multiple data objects and cross-references.

Additional information

To create a bulk import job, follow these steps:

  1. Initialize a batch object.
  2. Add items to the batch object.
  3. Ensure that the last batch is sent (flushed).

Basic import

The following example adds objects to the YourName collection.

    data_rows = [
{"title": f"Object {i+1}"} for i in range(5)
]

collection = client.collections.get("YourCollection")

with collection.batch.dynamic() as batch:
for data_row in data_rows:
batch.add_object(
properties=data_row,
)

Specify an ID value

Weaviate generates an UUID for each object. Object IDs must be unique. If you set object IDs, use one of these deterministic UUID methods to prevent duplicate IDs:

    from weaviate.util import generate_uuid5  # Generate a deterministic ID

data_rows = [{"title": f"Object {i+1}"} for i in range(5)]

collection = client.collections.get("YourCollection")

with collection.batch.dynamic() as batch:
for data_row in data_rows:
obj_uuid = generate_uuid5(data_row)
batch.add_object(
properties=data_row,
uuid=obj_uuid
)

Specify a vector

Use the vector property to specify a vector for each object.

    data_rows = [{"title": f"Object {i+1}"} for i in range(5)]
vectors = [[0.1] * 1536 for i in range(5)]

collection = client.collections.get("YourCollection")

with collection.batch.dynamic() as batch:
for i, data_row in enumerate(data_rows):
batch.add_object(
properties=data_row,
vector=vectors[i]
)

Specify named vectors

Added in v1.24

When you create an object, you can specify named vectors (if configured in your collection).

    data_rows = [{
"title": f"Object {i+1}",
"body": f"Body {i+1}"
} for i in range(5)]

title_vectors = [[0.12] * 1536 for _ in range(5)]
body_vectors = [[0.34] * 1536 for _ in range(5)]

collection = client.collections.get("YourCollection")

with collection.batch.dynamic() as batch:
for i, data_row in enumerate(data_rows):
batch.add_object(
properties=data_row,
vector={
"title": title_vectors[i],
"body": body_vectors[i],
}
)

Python-specific batching

The Python clients have built-in batching methods to help you optimize your import speed. Please see the relevant documentation for more information.

Stream data from large files

If your dataset is large, consider streaming the import to avoid out-of-memory issues.

    import ijson

# Settings for displaying the import progress
counter = 0
interval = 100 # print progress every this many records; should be bigger than the batch_size

print("JSON streaming, to avoid running out of memory on large files...")
with client.batch.fixed_size(batch_size=200) as batch:
with open("jeopardy_1k.json", "rb") as f:
objects = ijson.items(f, "item")
for obj in objects:
properties = {
"question": obj["Question"],
"answer": obj["Answer"],
}
batch.add_object(
collection="JeopardyQuestion",
properties=properties,
# If you Bring Your Own Vectors, add the `vector` parameter here
# vector=obj.vector["default"]
)

# Calculate and display progress
counter += 1
if counter % interval == 0:
print(f"Imported {counter} articles...")


print(f"Finished importing {counter} articles.")

Additional considerations

Added in v1.23.

To maximize import speed, enable asynchronous indexing and use gRPC batch imports.

Asynchronous indexing is an experimental feature in 1.22, and may not be suitable for production use. The Python client v4 uses gRPC. If you cannot use the new client, access the gRPC API directly.

To enable asynchronous indexing, set the ASYNC_INDEXING environment variable to true in your Weaviate configuration file.

weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:1.24.0
...
environment:
ASYNC_INDEXING: 'true'
...