Skip to main content

(Bonus) Generate vectors

Since we are using custom vectors, we need to generate them ourselves.

This step is optional, as the next section shows you how to download and use the pre-generated vectors. But if you are interested in how to generate vectors, read on.

Code

This example creates embeddings for the movie dataset:

import requests
import pandas as pd
import os


# Define a function to call the endpoint and obtain embeddings
def query(texts):
import os
import requests

model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = os.getenv("HUGGINGFACE_APIKEY")

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

response = requests.post(
api_url,
headers=headers,
json={"inputs": texts, "options": {"wait_for_model": True}},
)
return response.json()


# Get the source data
data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Loop through the dataset to generate vectors in batches
emb_dfs = list()
src_texts = list()
for i, row in enumerate(df.itertuples(index=False)):
# Concatenate text to create a source string
src_text = "Title" + row.title + "; Overview: " + row.overview
# Add to the buffer
src_texts.append(src_text)
if (len(src_texts) == 50) or (i + 1 == len(df)): # Get embeddings in batches of 50
# Get a batch of embeddings
output = query(src_texts)
emb_df = pd.DataFrame(output)
# Add the batch of embeddings to a list
emb_dfs.append(emb_df)
# Reset the buffer
src_texts = list()


emb_df = pd.concat(emb_dfs) # Create a combined dataset

# Save the data as a CSV
os.makedirs("scratch", exist_ok=True) # Create a folder if it doesn't exist
emb_df.to_csv(
f"scratch/movies_data_1990_2024_embeddings.csv",
index=False,
)

This will generate a vector for each movie in the dataset, which we can use when adding the movies to Weaviate.

Explain the code

Model

We use the sentence-transformers/all-MiniLM-L6-v2 model to generate the vectors. We access it here through the Hugging Face API for convenience. You could also use the transformers library, if you would like to perform the generation locally.

Source text

We combine the movie title and overview to create a source string for the model. This is the text that the model will "translate" into a vector.

    src_text = "Title" + row.title + "; Overview: " + row.overview

Get embeddings in batches

We use a buffer to store the concatenated strings, and then get the embeddings in batches. This is a good practice to limit the number of requests to the model, and to avoid timeouts.

        output = query(src_texts)
emb_df = pd.DataFrame(output)
# Add the batch of embeddings to a list
emb_dfs.append(emb_df)

Export the embeddings

The embeddings are then saved to a file so that we can use when adding the movies to Weaviate.

os.makedirs("scratch", exist_ok=True)  # Create a folder if it doesn't exist
emb_df.to_csv(
f"scratch/movies_data_1990_2024_embeddings.csv",
index=False,
)

Questions and feedback

If you have any questions or feedback, let us know in the user forum.