(Bonus) Generate vectors
Since we are using custom vectors, we need to generate them ourselves.
This step is optional, as the next section shows you how to download and use the pre-generated vectors. But if you are interested in how to generate vectors, read on.
Code
This example creates embeddings for the movie dataset:
import requests
import pandas as pd
import os
from typing import List
import cohere
from cohere import Client as CohereClient
co_token = os.getenv("COHERE_APIKEY")
co = cohere.Client(co_token)
# Define a function to call the endpoint and obtain embeddings
def vectorize(cohere_client: CohereClient, texts: List[str]) -> List[List[float]]:
response = cohere_client.embed(
texts=texts, model="embed-multilingual-v3.0", input_type="search_document"
)
return response.embeddings
# Get the source data
data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())
# Loop through the dataset to generate vectors in batches
emb_dfs = list()
src_texts = list()
for i, row in enumerate(df.itertuples(index=False)):
# Concatenate text to create a source string
src_text = "Title" + row.title + "; Overview: " + row.overview
# Add to the buffer
src_texts.append(src_text)
if (len(src_texts) == 50) or (i + 1 == len(df)): # Get embeddings in batches of 50
# Get a batch of embeddings
output = vectorize(co, src_texts)
index = list(range(i - len(src_texts) + 1, i + 1))
emb_df = pd.DataFrame(output, index=index)
# Add the batch of embeddings to a list
emb_dfs.append(emb_df)
# Reset the buffer
src_texts = list()
emb_df = pd.concat(emb_dfs) # Create a combined dataset
# Save the data as a CSV
os.makedirs("scratch", exist_ok=True) # Create a folder if it doesn't exist
emb_df.to_csv(
f"scratch/movies_data_1990_2024_embeddings.csv",
index=False,
)
This will generate a vector for each movie in the dataset, which we can use when adding the movies to Weaviate.
Explain the code
Model
We use the embed-multilingual-v3.0
Cohere model to generate the vector embeddings. You could also use the transformers
library, if you would like to perform the generation locally.
Source text
We combine the movie title and overview to create a source string for the model. This is the text that the model will "translate" into a vector.
src_text = "Title" + row.title + "; Overview: " + row.overview
Get embeddings in batches
We use a buffer to store the concatenated strings, and then get the embeddings in batches. This is a good practice to limit the number of requests to the model, and to avoid timeouts.
output = vectorize(co, src_texts)
index = list(range(i - len(src_texts) + 1, i + 1))
emb_df = pd.DataFrame(output, index=index)
# Add the batch of embeddings to a list
emb_dfs.append(emb_df)
Export the embeddings
The embeddings are then saved to a file so that we can use when adding the movies to Weaviate.
os.makedirs("scratch", exist_ok=True) # Create a folder if it doesn't exist
emb_df.to_csv(
f"scratch/movies_data_1990_2024_embeddings.csv",
index=False,
)
Questions and feedback
If you have any questions or feedback, let us know in the user forum.