(Bonus) Generate vectors

Since we are using custom vectors, we need to generate them ourselves.

This step is optional, as the next section shows you how to download and use the pre-generated vectors. But if you are interested in how to generate vectors, read on.

Code

This example creates embeddings for the movie dataset:

import requests
import pandas as pd
import os
from typing import List
import cohere
from cohere import Client as CohereClient

co_token = os.getenv("COHERE_APIKEY")
co = cohere.Client(co_token)


# Define a function to call the endpoint and obtain embeddings
def vectorize(cohere_client: CohereClient, texts: List[str]) -> List[List[float]]:

    response = cohere_client.embed(
        texts=texts, model="embed-multilingual-v3.0", input_type="search_document"
    )

    return response.embeddings


# Get the source data
data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Loop through the dataset to generate vectors in batches
emb_dfs = list()
src_texts = list()
for i, row in enumerate(df.itertuples(index=False)):
    # Concatenate text to create a source string
    src_text = "Title" + row.title + "; Overview: " + row.overview
    # Add to the buffer
    src_texts.append(src_text)
    if (len(src_texts) == 50) or (i + 1 == len(df)):  # Get embeddings in batches of 50
        # Get a batch of embeddings
        output = vectorize(co, src_texts)
        index = list(range(i - len(src_texts) + 1, i + 1))
        emb_df = pd.DataFrame(output, index=index)
        # Add the batch of embeddings to a list
        emb_dfs.append(emb_df)
        # Reset the buffer
        src_texts = list()


emb_df = pd.concat(emb_dfs)  # Create a combined dataset

# Save the data as a CSV
os.makedirs("scratch", exist_ok=True)  # Create a folder if it doesn't exist
emb_df.to_csv(
    f"scratch/movies_data_1990_2024_embeddings.csv",
    index=False,
)

API docs

This will generate a vector for each movie in the dataset, which we can use when adding the movies to Weaviate.

Explain the code

Model

We use the embed-multilingual-v3.0 Cohere model to generate the vector embeddings. You could also use the transformers library, if you would like to perform the generation locally.

Source text

We combine the movie title and overview to create a source string for the model. This is the text that the model will "translate" into a vector.

    src_text = "Title" + row.title + "; Overview: " + row.overview

API docs

Get embeddings in batches

We use a buffer to store the concatenated strings, and then get the embeddings in batches. This is a good practice to limit the number of requests to the model, and to avoid timeouts.

        output = vectorize(co, src_texts)
        index = list(range(i - len(src_texts) + 1, i + 1))
        emb_df = pd.DataFrame(output, index=index)
        # Add the batch of embeddings to a list
        emb_dfs.append(emb_df)

API docs

Export the embeddings

The embeddings are then saved to a file so that we can use when adding the movies to Weaviate.

os.makedirs("scratch", exist_ok=True)  # Create a folder if it doesn't exist
emb_df.to_csv(
    f"scratch/movies_data_1990_2024_embeddings.csv",
    index=False,
)

API docs

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Code​

Explain the code​

Model​

Source text​

Get embeddings in batches​

Export the embeddings​

Questions and feedback​

Code

Explain the code

Model

Source text

Get embeddings in batches

Export the embeddings

Questions and feedback