Import data

Code

This example imports the movie data into our collection.

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os
import zipfile
from pathlib import Path
import base64

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_local()

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Create a directory for the images
img_dir = Path("scratch/imgs")
img_dir.mkdir(parents=True, exist_ok=True)

# Download images
posters_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024_posters.zip"
posters_path = img_dir / "movies_data_1990_2024_posters.zip"
posters_path.write_bytes(requests.get(posters_url).content)

# Unzip the images
with zipfile.ZipFile(posters_path, 'r') as zip_ref:
    zip_ref.extractall(img_dir)

# Get the collection
movies = client.collections.get("MovieMM")

# Enter context manager
with movies.batch.fixed_size(50) as batch:
    # Loop through the data
    for i, movie in tqdm(df.iterrows()):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
            tzinfo=timezone.utc
        )
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])
        # Convert image to base64
        img_path = (img_dir / f"{movie['id']}_poster.jpg")
        with open(img_path, "rb") as file:
            poster_b64 = base64.b64encode(file.read()).decode("utf-8")

        # Build the object payload
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
            "poster": poster_b64,
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
            # references=reference_obj  # You can add references here
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")
    for failed in movies.batch.failed_objects:
        print(f"e.g. Failed to import object with error: {failed.message}")

client.close()

API docs

The code:

Loads the source text and image data
Gets the collection
Enters a context manager with a batcher (batch) object
Loops through the data and:
- Finds corresponding image to the text
- Converts the image to base64
- Adds objects to the batcher
Prints out any import errors

Explain the code

Preparation

We use the requests library to load the data from the source, in this case a JSON file containing text data and a Zip file containing posters. The text data is then converted to a Pandas DataFrame for easier manipulation and the images are extracted from the Zip file.

Then, we create a collection object (with client.collections.get) so we can interact with the collection.

Batch context manager

The batch object is a context manager that allows you to add objects to the batcher. This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.

with movies.batch.fixed_size(50) as batch:

API docs

This example uses the .fixed_size() method to create a batcher which sets the number of objects per batch. There are also other batcher types, like .rate_limit() for specifying the number of objects per minute and .dynamic() to create a dynamic batcher, which automatically determines and updates the batch size during the import process.

Add data to the batcher

Convert data types

The data is converted from a string to the correct data types for Weaviate. For example, the release_date is converted to a datetime object, and the genre_ids are converted to a list of integers.

        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
            tzinfo=timezone.utc
        )
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

API docs

To save the image data as a BLOB (binary large object) data type, we convert the image to base64.

        img_path = (img_dir / f"{movie['id']}_poster.jpg")
        with open(img_path, "rb") as file:
            poster_b64 = base64.b64encode(file.read()).decode("utf-8")

API docs

Add objects to the batcher

Then we loop through the data and add each object to the batcher. The batch.add_object method is used to add the object to the batcher, and the batcher will send the batch according to the specified batcher type.

        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
            "poster": poster_b64,
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
            # references=reference_obj  # You can add references here
        )

API docs

Error handling

Because a batch includes multiple objects, it's possible that some objects will fail to import. The batcher saves these errors.

You can print out the errors to see what went wrong, and then decide how to handle them, such as by raising an exception. In this example, we simply print out the errors.

if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")
    for failed in movies.batch.failed_objects:
        print(f"e.g. Failed to import object with error: {failed.message}")

client.close()

API docs

Note that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.

Where do the vectors come from?

When the batcher sends the queue to Weaviate, the objects are added to the collection. In our case, the movie collection.

Recall that the collection has a vectorizer module, and we do not specify vectors here. So Weaviate uses the specified vectorizer to generate vector embeddings from the data.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Code​

Explain the code​

Preparation​

Batch context manager​

Add data to the batcher​

Convert data types​

Add objects to the batcher​

Error handling​

Where do the vectors come from?​

Questions and feedback​

Code

Explain the code

Preparation

Batch context manager

Add data to the batcher

Convert data types

Add objects to the batcher

Error handling

Where do the vectors come from?

Questions and feedback