Importing custom vectors
Overviewโ
This tutorial will show you how to import a large dataset (25k articles from Wikipedia) that already includes vectors (embeddings generated by OpenAI). We will,
- download and unzip a CSV file that contains the Wikipedia articles
- create a Weaviate instance
- create a schema
- parse the file and batch import the records, with Python and JavaScript code
- make sure the data was imported correctly
- run a few queries to demonstrate semantic search capabilities
Prerequisitesโ
If you haven't yet, we recommend going through the Quickstart tutorial first to get the most out of this section.
Before you start this tutorial, make sure to have:
- An OpenAI API key. Even though we already have vector embeddings generated by OpenAI, we'll need an OpenAI key to vectorize search queries, and to recalculate vector embeddings for updated object contents.
- Your preferred Weaviate client library installed.
See how to delete data from previous tutorials (or previous runs of this tutorial).
If your Weaviate instance contains data you want removed, you can manually delete the unwanted class(es).
Know that deleting a class will also delete all associated objects!
Do not do this to a production database, or anywhere where you do not wish to delete your data.
Run the code below to delete the relevant class and its objects.
- Python
- TypeScript
- Go
- Curl
# delete class "YourClassName" - THIS WILL DELETE ALL DATA IN THIS CLASS
client.schema.delete_class("YourClassName") # Replace with your class name - e.g. "Question"
var className: string = 'YourClassName'; // Replace with your class name
client.schema
.classDeleter()
.withClassName(className)
.do()
.then((res: any) => {
console.log(res);
})
.catch((err: Error) => {
console.error(err)
});
className := "YourClassName"
// delete the class
if err := client.Schema().ClassDeleter().WithClassName(className).Do(context.Background()); err != nil {
// Weaviate will return a 400 if the class does not exist, so this is allowed, only return an error if it's not a 400
if status, ok := err.(*fault.WeaviateClientError); ok && status.StatusCode != http.StatusBadRequest {
panic(err)
}
}
curl \
-X DELETE \
https://some-endpoint.weaviate.network/v1/schema/YourClassName
Download the datasetโ
We will use this Simple English Wikipedia dataset hosted by OpenAI (~700MB zipped, 1.7GB CSV file) that includes vector embeddings. These are the columns of interest, where content_vector
is a vector embedding with 1536 elements (dimensions), generated using OpenAI's text-embedding-ada-002
model:
id | url | title | text | content_vector |
---|---|---|---|---|
1 | https://simple | April | "April is the fourth month of the year..." | [-0.011034, -0.013401, ..., -0.009095] |
If you haven't already, make sure to download the dataset and unzip the file. You should end up with vector_database_wikipedia_articles_embedded.csv
in your working directory. The records are mostly (but not strictly) sorted by title.
Download Wikipedia dataset ZIP
Create a Weaviate instanceโ
We can create a Weaviate instance locally using the embedded option on Linux (transparent and fastest), Docker on any OS (fastest import and search), or in the cloud using the Weaviate Cloud Services (easiest setup, but importing may be slower due to the network speed). Each option is explained on its Installation page.
If using the Docker option, make sure to select "With Modules" (instead of standalone), and the text2vec-openai
module when using the Docker configurator, at the "Vectorizer & Retriever Text Module" step. At the "OpenAI Requires an API Key" step, you can choose to "provide the key with each request", as we'll do so in the next section.
Connect to the instance and OpenAIโ
To pave the way for using OpenAI later when querying, let's make sure we provide the OpenAI API key to the client.
The API key can be provided to Weaviate as an environment variable, or in the HTTP header with every request. Here, we will add it to the Weaviate client at instantiation as shown below. The client will then send the key as a part of the HTTP request header with every request.
- Python
- TypeScript
- Go
- Curl
import weaviate
import json
client = weaviate.Client(
url = "https://some-endpoint.weaviate.network", # Replace with your endpoint
auth_client_secret=weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace w/ your Weaviate instance API key
additional_headers = {
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY" # Replace with your inference API key
}
)
import weaviate, { WeaviateClient, ObjectsBatcher, ApiKey } from 'weaviate-ts-client';
import fetch from 'node-fetch';
const client: WeaviateClient = weaviate.client({
scheme: 'https',
host: 'some-endpoint.weaviate.network', // Replace with your endpoint
apiKey: new ApiKey('YOUR-WEAVIATE-API-KEY'), // Replace w/ your Weaviate instance API key
headers: {'X-OpenAI-Api-Key': 'YOUR-OPENAI-API-KEY'}, // Replace with your inference API key
});
cfg := weaviate.Config{
Host: "some-endpoint.weaviate.network/", // Replace with your endpoint
Scheme: "https",
AuthConfig: auth.ApiKey{Value: "YOUR-WEAVIATE-API-KEY"}, // Replace w/ your Weaviate instance API key
Headers: map[string]string{
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY", // Replace with your inference API key
},
}
client, err := weaviate.NewClient(cfg)
if err != nil {
panic(err)
}
classObj := &models.Class{
Class: "Question",
Vectorizer: "text2vec-openai",
}
if client.Schema().ClassCreator().WithClass(classObj).Do(context.Background()) != nil {
panic(err)
}
- With
curl
, add the API key to the header as shown below:
echo '{
"query": "<QUERY>"
}' | curl \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR-WEAVIATE-API-KEY" \
-H "X-OpenAI-Api-Key: YOUR-OPENAI-API-KEY" \
-d @- \
https://some-endpoint.weaviate.network/v1/graphql
You can explore data in a self-hosted Weaviate instance using Weaviate's GraphQL interface by connecting to its endpoint, such as localhost:8080
. If you get an error about a missing OpenAI key (e.g. when running a nearText
query), you can supply the OpenAI API key by exporting it via the OPENAI_APIKEY
environment variable before running docker-compose up
.
Create the schemaโ
The schema defines the data structure for objects in a given Weaviate class. We'll create a schema for a Wikipedia Article
class mapping the CSV columns, and using the text2vec-openai vectorizer. The schema will have two properties:
title
- article title, not vectorizedcontent
- article content, corresponding to thetext
column from the CSV
As of Weaviate 1.18, the text2vec-openai
vectorizer uses by default the same model as the OpenAI dataset, text-embedding-ada-002
. To make sure the tutorial will work the same way if this default changes (i.e. if OpenAI releases an even better-performing model and Weaviate switches to it as the default), we'll configure the schema vectorizer explicitly to use the same model:
{
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text"
}
}
}
Another detail to be careful about is how exactly we store the content_vector
embedding. Weaviate vectorizes entire objects (not properties), and it includes by default the class name in the string serialization of the object it will vectorize. Since OpenAI has provided embeddings only for the text
(content) field, we need to make sure Weaviate vectorizes an Article
object the same way. That means we need to disable including the class name in the vectorization, so we must set vectorizeClassName: false
in the text2vec-openai
section of the moduleConfig
. Together, these schema settings will look like this:
- Python
- JavaScript
# client.schema.delete_all() # โ ๏ธ uncomment to start from scratch by deleting ALL data
# ===== Create Article class for the schema =====
article_class = {
"class": "Article",
"description": "An article from the Simple English Wikipedia data set",
"vectorizer": "text2vec-openai",
"moduleConfig": {
# Match how OpenAI created the embeddings for the `content` (`text`) field
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": False
}
},
"properties": [
{
"name": "title",
"description": "The title of the article",
"dataType": ["text"],
# Don't vectorize the title
"moduleConfig": {"text2vec-openai": {"skip": True}}
},
{
"name": "content",
"description": "The content of the article",
"dataType": ["text"],
}
]
}
# Add the Article class to the schema
client.schema.create_class(article_class)
print('Created schema');
// Start from scratch by deleting all Articles if the class exists
try {
await client.schema.classDeleter().withClassName('Article').do();
console.log('Deleted existing Articles');
} catch (e) {
if (!e.match(/could not find class/))
throw e;
}
// ===== Create Article class for the schema =====
const articleClass = {
class: "Article",
description: "An article from the Simple English Wikipedia data set",
vectorizer: "text2vec-openai",
moduleConfig: {
// Match how OpenAI created the embeddings for the `content` (`text`) field
'text2vec-openai': {
model: 'ada',
modelVersion: '002',
type: 'text',
vectorizeClassName: false
}
},
properties: [
{
name: 'title',
description: 'The title of the article',
dataType: [ 'string' ],
// Don't vectorize the title
moduleConfig: { 'text2vec-openai': { skip: true } }
},
{
name: 'content',
description: 'The content of the article',
dataType: [ 'text' ],
}
]
}
// Add the Article class to the schema
await client.schema.classCreator().withClass(articleClass).do();
console.log('Created schema');
To quickly check that the schema was created correctly, you can navigate to <weaviate-endpoint>/v1/schema
. For example in the Docker installation scenario, go to http://localhost:8080/v1/schema
or run,
curl -s http://localhost:8080/v1/schema | jq
The jq
command used after curl
is a handy JSON preprocessor. When simply piping some text through it, jq
returns the text pretty-printed and syntax-highlighted.
Import the articlesโ
We're now ready to import the articles. For maximum performance, we'll load the articles into Weaviate via batch import.
- Python
- JavaScript
# ===== Import data =====
# Configure the batch import
client.batch.configure(
batch_size=100,
)
# Settings for displaying the import progress
counter = 0
interval = 100 # print progress every this many records
# Create a pandas dataframe iterator with lazy-loading,
# so we don't load all records in RAM at once.
import pandas as pd
csv_iterator = pd.read_csv(
'vector_database_wikipedia_articles_embedded.csv',
usecols=['id', 'url', 'title', 'text', 'content_vector'],
chunksize=100, # number of rows per chunk
# nrows=350 # optionally limit the number of rows to import
)
# Iterate through the dataframe chunks and add each CSV record to the batch
import ast
for chunk in csv_iterator:
for index, row in chunk.iterrows():
properties = {
"title": row.title,
"content": row.text,
"url": row.url
}
# Convert the vector from CSV string back to array of floats
vector = ast.literal_eval(row.content_vector)
# Add the object to the batch, and set its vector embedding
client.batch.add_data_object(properties, "Article", vector=vector)
# Calculate and display progress
counter += 1
if counter % interval == 0:
print(f"Imported {counter} articles...")
client.batch.flush()
print(f"Finished importing {counter} articles.")
// ===== Import data =====
import fs from 'fs';
import csv from 'csv-parser';
async function importCSV(filePath) {
let batcher = client.batch.objectsBatcher();
let counter = 0;
const batchSize = 100;
return new Promise((resolve, reject) => {
fs.createReadStream(filePath)
.pipe(csv())
.on('data', async (row) => {
// Import each record
const obj = {
class: 'Article',
properties: {
title: row.title,
content: row.text,
url: row.url,
},
vector: JSON.parse(row['content_vector']),
}
// Add the object to the batch queue
batcher = batcher.withObject(obj);
counter++;
// When the batch counter reaches batchSize, push the objects to Weaviate
if (counter % batchSize === 0) {
console.log(`Imported ${counter} articles...`);
// Flush the batch queue and restart it
await batcher.do();
batcher = client.batch.objectsBatcher();
}
})
.on('end', async () => {
// Flush the remaining objects
await batcher.do();
console.log(`Finished importing ${counter} articles.`);
resolve();
});
});
}
await importCSV('vector_database_wikipedia_articles_embedded.csv');
Checking the import went correctlyโ
Two quick sanity checks that the import went as expected:
- Get the number of articles
- Get 5 articles
Go to the Weaviate GraphQL console, connect to your Weaviate endpoint (e.g. http://localhost:8080
or https://some-endpoint.weaviate.network
), then run the GraphQL query below:
query {
Aggregate { Article { meta { count } } }
Get {
Article(limit: 5) {
title
url
}
}
}
You should see the Aggregate.Article.meta.count
field equal to the number of articles you've imported (e.g. 25,000), as well as five random articles with their title
and url
fields.
Queriesโ
Now that we have the articles imported, let's run some queries!
nearTextโ
The nearText
filter lets us search for objects close (in vector space) to the vector representation of one or more concepts. For example, the vector for the query "modern art in Europe" would be close to the vector for the article Documenta, which describes
"one of the most important exhibitions of modern art in the world... [taking] place in Kassel, Germany".
- GraphQL
- Python
- JavaScript
- Curl
{
Get {
Article(
nearText: {concepts: ["modern art in Europe"]},
limit: 1
) {
title
content
}
}
}
import weaviate
import json
client = weaviate.Client(
url="https://some-endpoint.weaviate.network/", # replace with your endpoint
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY" # Replace with your API key
}
)
nearText = {"concepts": ["modern art in Europe"]}
result = (
client.query
.get("Article", ["title", "content"])
.with_near_text(nearText)
.with_limit(1)
.do()
)
print(json.dumps(result, indent=4))
const nearTextResult = await client.graphql
.get()
.withClassName('Article')
.withFields('title content')
.withNearText({concepts: ['modern art in Europe']})
.withLimit(1)
.do();
console.log('nearText: modern art in Europe = ', JSON.stringify(nearTextResult.data['Get']['Article'], null, 2));
$ echo '{
"query": "{
Get{
Article(
nearText: {
concepts: [\"modern art in Europe\"],
},
limit: 1
){
title
content
}
}
}"
}' | curl \
-X POST \
-H 'Content-Type: application/json' \
-H "X-OpenAI-Api-Key: YOUR-OPENAI-API-KEY" \
-d @- \
http://localhost:8080/v1/graphql
hybridโ
While nearText
uses dense vectors to find objects similar in meaning to the search query, it does not perform very well on keyword searches. For example, a nearText
search for "jackfruit" in this Simple English Wikipedia dataset, will find "cherry tomato" as the top result. For these (and indeed, most) situation, we can obtain better search results by using the hybrid
filter, which combines dense vector search with keyword search:
- GraphQL
- Python
- JavaScript
- Curl
{
Get {
Article (
hybrid: {
query: "jackfruit"
alpha: 0.5 # default 0.75
}
limit: 3
) {
title
content
_additional {score}
}
}
}
result = (
client.query
.get("Article", ["title", "content"])
.with_hybrid("jackfruit", alpha=0.5) # default 0.75
.with_limit(3)
.do()
)
print(json.dumps(result, indent=4))
const hybridResult = await client.graphql
.get()
.withClassName('Article')
.withFields('title content _additional{score}')
.withHybrid({
query: 'jackfruit',
alpha: 0.5, // optional, defaults to 0.75
})
.withLimit(3)
.do();
console.log('hybrid: jackfruit = ', JSON.stringify(hybridResult.data['Get']['Article'], null, 2));
$ echo '{
"query": "{
Get {
Article (
hybrid: {
query: \"jackfruit\"
alpha: 0.5
}
limit: 3
) {
title
content
_additional {score}
}
}
}"
}' | curl \
-X POST \
-H 'Content-Type: application/json' \
-H "X-OpenAI-Api-Key: YOUR-OPENAI-API-KEY" \
-d @- \
http://localhost:8080/v1/graphql
Recapโ
In this tutorial, we've learned
- how to efficiently import large datasets using Weaviate batching and CSV lazy loading with
pandas
/csv-parser
- how to import existing vectors ("Bring Your Own Vectors")
- how to quickly check that all records were imported
- how to use
nearText
andhybrid
searches
Suggested readingโ
More Resourcesโ
If you can't find the answer to your question here, please look at the:
- Frequently Asked Questions. Or,
- Knowledge base of old issues. Or,
- For questions: Stackoverflow. Or,
- For more involved discussion: Weaviate Community Forum. Or,
- We also have a Slack channel.