Import data

Code

This example imports the movie data into our collection.

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
data_resp = requests.get(data_url)
df = pd.DataFrame(data_resp.json())

# Load the embeddings (embeddings from the previous step)
embs_path = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024_embeddings.csv"
# Or load embeddings from a local file (if you generated them earlier)
# embs_path = "scratch/movies_data_1990_2024_embeddings.csv"

emb_df = pd.read_csv(embs_path)

# Get the collection
movies = client.collections.get("MovieCustomVector")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
    # Loop through the data
    for i, movie in enumerate(df.itertuples(index=False)):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.strptime(movie.release_date, "%Y-%m-%d").replace(
            tzinfo=timezone.utc
        )
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie.genre_ids)

        # Build the object payload
        movie_obj = {
            "title": movie.title,
            "overview": movie.overview,
            "vote_average": movie.vote_average,
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie.id,
        }

        # Get the vector
        vector = emb_df.iloc[i].to_list()

        # Add object (including vector) to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie.id),
            vector=vector  # Add the custom vector
            # references=reference_obj  # You can add references here
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

API docs

The code:

Loads the source data & gets the collection
Enters a context manager with a batcher (batch) object
Loops through the data and adds objects and corresponding vectors to the batcher
Prints out any import errors

Explain the code

Preparation

We use the requests library to load the data from the source, in this case a JSON file. The data is then converted to a Pandas DataFrame for easier manipulation.

Then, we create a collection object (with client.collections.get) so we can interact with the collection.

Batch context manager

The batch object is a context manager that allows you to add objects to the batcher. This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.

with movies.batch.fixed_size(batch_size=200) as batch:

API docs

This example uses the .fixed_size() method to create a batcher which sets the number of objects per batch. There are also other batcher types, like .rate_limit() for specifying the number of objects per minute and .dynamic() to create a dynamic batcher, which automatically determines and updates the batch size during the import process.

Add data to the batcher

Convert data types

The data is converted from a string to the correct data types for Weaviate. For example, the release_date is converted to a datetime object, and the genre_ids are converted to a list of integers.

        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.strptime(movie.release_date, "%Y-%m-%d").replace(
            tzinfo=timezone.utc
        )
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie.genre_ids)

API docs

Add objects to the batcher

Then we loop through the data and add each object to the batcher. The batch.add_object method is used to add the object to the batcher, and the batcher will send the batch according to the specified batcher type.

Note here that we provide the vector data as well.

        movie_obj = {
            "title": movie.title,
            "overview": movie.overview,
            "vote_average": movie.vote_average,
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie.id,
        }

        # Get the vector
        vector = emb_df.iloc[i].to_list()

        # Add object (including vector) to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie.id),
            vector=vector  # Add the custom vector
            # references=reference_obj  # You can add references here
        )

API docs

Error handling

Because a batch includes multiple objects, it's possible that some objects will fail to import. The batcher saves these errors.

You can print out the errors to see what went wrong, and then decide how to handle them, such as by raising an exception. In this example, we simply print out the errors.

if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

API docs

Note that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Code​

Explain the code​

Preparation​

Batch context manager​

Add data to the batcher​

Convert data types​

Add objects to the batcher​

Error handling​

Questions and feedback​

Code

Explain the code

Preparation

Batch context manager

Add data to the batcher

Convert data types

Add objects to the batcher

Error handling

Questions and feedback