Skip to main content

Import data

Code

This example imports the movie data into our collection.

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_wcs(...) or
# client = weaviate.connect_to_local(...)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
data_resp = requests.get(data_url)
df = pd.DataFrame(data_resp.json())

# Load the embeddings (embeddings from the previous step)
embs_path = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024_embeddings.csv"
# Or load embeddings from a local file (if you generated them earlier)
# embs_path = "scratch/movies_data_1990_2024_embeddings.csv"

emb_df = pd.read_csv(embs_path)

# Get the collection
movies = client.collections.get("Movie")

# Enter context manager
with movies.batch.dynamic() as batch:
# Loop through the data
for i, movie in enumerate(df.itertuples(index=False)):
# Convert data types
# Convert a JSON date to `datetime` and add time zone information
release_date = datetime.strptime(movie.release_date, "%Y-%m-%d").replace(
tzinfo=timezone.utc
)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie.genre_ids)

# Build the object payload
movie_obj = {
"title": movie.title,
"overview": movie.overview,
"vote_average": movie.vote_average,
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie.id,
}

# Get the vector
vector = emb_df.iloc[i].to_list()

# Add object (including vector) to batch queue
batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie.id),
vector=vector # Add the custom vector
# references=reference_obj # You can add references here
)
# Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

The code:

  • Loads the source data & gets the collection
  • Enters a context manager with a batcher (batch) object
  • Loops through the data and adds objects and corresponding vectors to the batcher
  • Prints out any import errors

Explain the code

Preparation

We use the requests library to load the data from the source, in this case a JSON file. The data is then converted to a Pandas DataFrame for easier manipulation.

Then, we create a collection object (with client.collections.get) so we can interact with the collection.

Batch context manager

The batch object is a context manager that allows you to add objects to the batcher. This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.

with movies.batch.dynamic() as batch:

This example uses the .dynamic() method to create a dynamic batcher, which automatically determines and updates the batch size during the import process. There are also other batcher types, like .fixed_size() for specifying the number of objects per batch, and .rate_limit() for specifying the number of objects per minute.

Add data to the batcher

Convert data types

The data is converted from a string to the correct data types for Weaviate. For example, the release_date is converted to a datetime object, and the genre_ids are converted to a list of integers.

        # Convert a JSON date to `datetime` and add time zone information
release_date = datetime.strptime(movie.release_date, "%Y-%m-%d").replace(
tzinfo=timezone.utc
)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie.genre_ids)

Add objects to the batcher

Then we loop through the data and add each object to the batcher. The batch.add_object method is used to add the object to the batcher, and the batcher will send the batch according to the specified batcher type.

Note here that we provide the vector data as well.

        movie_obj = {
"title": movie.title,
"overview": movie.overview,
"vote_average": movie.vote_average,
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie.id,
}

# Get the vector
vector = emb_df.iloc[i].to_list()

# Add object (including vector) to batch queue
batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie.id),
vector=vector # Add the custom vector
# references=reference_obj # You can add references here
)

Error handling

Because a batch includes multiple objects, it's possible that some objects will fail to import. The batcher saves these errors.

You can print out the errors to see what went wrong, and then decide how to handle them, such as by raising an exception. In this example, we simply print out the errors.

if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

Note that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.

Questions and feedback

If you have any questions or feedback, please let us know on our forum. For example, you can: