Skip to main content

Import data

Code

This example imports the movie data into our collection.

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os
import zipfile
from pathlib import Path
import base64

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_local()

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Create a directory for the images
img_dir = Path("scratch/imgs")
img_dir.mkdir(parents=True, exist_ok=True)

# Download images
posters_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024_posters.zip"
posters_path = img_dir / "movies_data_1990_2024_posters.zip"
posters_path.write_bytes(requests.get(posters_url).content)

# Unzip the images
with zipfile.ZipFile(posters_path, 'r') as zip_ref:
zip_ref.extractall(img_dir)

# Get the collection
movies = client.collections.get("MovieMM")

# Enter context manager
with movies.batch.fixed_size(50) as batch:
# Loop through the data
for i, movie in tqdm(df.iterrows()):
# Convert data types
# Convert a JSON date to `datetime` and add time zone information
release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
tzinfo=timezone.utc
)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie["genre_ids"])
# Convert image to base64
img_path = (img_dir / f"{movie['id']}_poster.jpg")
with open(img_path, "rb") as file:
poster_b64 = base64.b64encode(file.read()).decode("utf-8")

# Build the object payload
movie_obj = {
"title": movie["title"],
"overview": movie["overview"],
"vote_average": movie["vote_average"],
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie["id"],
"poster": poster_b64,
}

# Add object to batch queue
batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie["id"])
# references=reference_obj # You can add references here
)
# Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")
for failed in movies.batch.failed_objects:
print(f"e.g. Failed to import object with error: {failed.message}")

client.close()

The code:

  • Loads the source text and image data
  • Gets the collection
  • Enters a context manager with a batcher (batch) object
  • Loops through the data and:
    • Finds corresponding image to the text
    • Converts the image to base64
    • Adds objects to the batcher
  • Prints out any import errors

Explain the code

Preparation

We use the requests library to load the data from the source, in this case a JSON file containing text data and a Zip file containing posters. The text data is then converted to a Pandas DataFrame for easier manipulation and the images are extracted from the Zip file.

Then, we create a collection object (with client.collections.get) so we can interact with the collection.

Batch context manager

The batch object is a context manager that allows you to add objects to the batcher. This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.

with movies.batch.fixed_size(50) as batch:

This example uses the .dynamic() method to create a dynamic batcher, which automatically determines and updates the batch size during the import process. There are also other batcher types, like .fixed_size() for specifying the number of objects per batch, and .rate_limit() for specifying the number of objects per minute.

Add data to the batcher

Convert data types

The data is converted from a string to the correct data types for Weaviate. For example, the release_date is converted to a datetime object, and the genre_ids are converted to a list of integers.

        # Convert a JSON date to `datetime` and add time zone information
release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
tzinfo=timezone.utc
)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie["genre_ids"])

To save the image data as a BLOB (binary large object) data type, we convert the image to base64.

        img_path = (img_dir / f"{movie['id']}_poster.jpg")
with open(img_path, "rb") as file:
poster_b64 = base64.b64encode(file.read()).decode("utf-8")

Add objects to the batcher

Then we loop through the data and add each object to the batcher. The batch.add_object method is used to add the object to the batcher, and the batcher will send the batch according to the specified batcher type.

        movie_obj = {
"title": movie["title"],
"overview": movie["overview"],
"vote_average": movie["vote_average"],
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie["id"],
"poster": poster_b64,
}

# Add object to batch queue
batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie["id"])
# references=reference_obj # You can add references here
)

Error handling

Because a batch includes multiple objects, it's possible that some objects will fail to import. The batcher saves these errors.

You can print out the errors to see what went wrong, and then decide how to handle them, such as by raising an exception. In this example, we simply print out the errors.

if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")
for failed in movies.batch.failed_objects:
print(f"e.g. Failed to import object with error: {failed.message}")

client.close()

Note that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.

Where do the vectors come from?

When the batcher sends the queue to Weaviate, the objects are added to the collection. In our case, the movie collection.

Recall that the collection has a vectorizer module, and we do not specify vectors here. So Weaviate uses the specified vectorizer to generate vector embeddings from the data.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.