Wikipedia with custom vectors
Overview
This tutorial will show you how to import a large dataset (25k articles from Wikipedia) that already includes vectors (embeddings generated by OpenAI). We will,
- download and unzip a CSV file that contains the Wikipedia articles
- create a Weaviate instance
- create a schema
- parse the file and batch import the records, with Python and JavaScript code
- make sure the data was imported correctly
- run a few queries to demonstrate semantic search capabilities
Prerequisites
If you haven't yet, we recommend going through the Quickstart tutorial first to get the most out of this section.
Before you start this tutorial, make sure to have:
- An OpenAI API key. Even though we already have vector embeddings generated by OpenAI, we'll need an OpenAI key to vectorize search queries, and to recalculate vector embeddings for updated object contents.
- Your preferred Weaviate client library installed.
See how to delete data from previous tutorials (or previous runs of this tutorial).
You can delete any unwanted collection(s), along with the data that they contain.
When you delete a collection, you delete all associated objects!
Be very careful with deletes on a production database and anywhere else that you have important data.
This code deletes a collection and its objects.
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
- Go
- Java
- Curl
# collection_name can be a string ("Article") or a list of strings (["Article", "Category"])
client.collections.delete(collection_name) # THIS WILL DELETE THE SPECIFIED COLLECTION(S) AND THEIR OBJECTS
# Note: you can also delete all collections in the Weaviate instance with:
# client.collections.delete_all()
# delete class "Article" - THIS WILL DELETE ALL DATA IN THIS CLASS
client.schema.delete_class("Article") # Replace with your class name
// delete collection "Article" - THIS WILL DELETE THE COLLECTION AND ALL ITS DATA
await client.collections.delete('Article')
// you can also delete all collections of a cluster
// await client.collections.deleteAll()
// delete collection "Article" - THIS WILL DELETE THE COLLECTION AND ALL ITS DATA
await client.schema
.classDeleter()
.withClassName('Article')
.do();
className := "YourClassName"
// delete the class
if err := client.Schema().ClassDeleter().WithClassName(className).Do(context.Background()); err != nil {
// Weaviate will return a 400 if the class does not exist, so this is allowed, only return an error if it's not a 400
if status, ok := err.(*fault.WeaviateClientError); ok && status.StatusCode != http.StatusBadRequest {
panic(err)
}
}
Result<Boolean> result = client.schema().classDeleter()
.withClassName(className)
.run();
curl \
-X DELETE \
https://WEAVIATE_INSTANCE_URL/v1/schema/YourClassName # Replace WEAVIATE_INSTANCE_URL with your instance URL
Download the dataset
We will use this Simple English Wikipedia dataset hosted by OpenAI (~700MB zipped, 1.7GB CSV file) that includes vector embeddings. These are the columns of interest, where content_vector
is a vector embedding with 1536 elements (dimensions), generated using OpenAI's text-embedding-ada-002
model:
id | url | title | text | content_vector |
---|---|---|---|---|
1 | https://simple | April | "April is the fourth month of the year..." | [-0.011034, -0.013401, ..., -0.009095] |
If you haven't already, make sure to download the dataset and unzip the file. You should end up with vector_database_wikipedia_articles_embedded.csv
in your working directory. The records are mostly (but not strictly) sorted by title.
Download Wikipedia dataset ZIP
Create a Weaviate instance
We can create a Weaviate instance locally using the embedded option on Linux (transparent and fastest), Docker on any OS (fastest import and search), or in the cloud using the Weaviate Cloud (easiest setup, but importing may be slower due to the network speed). Each option is explained on its Installation page.
If using the Docker option, make sure to select "With Modules" (instead of standalone), and the text2vec-openai
module when using the Docker configurator, at the "Vectorizer & Retriever Text Module" step. At the "OpenAI Requires an API Key" step, you can choose to "provide the key with each request", as we'll do so in the next section.
Connect to the instance and OpenAI
Add the OpenAI API key to the client so you can use the OpenAI vectorizer API when you send queries to Weaviate.
The API key can be provided to Weaviate as an environment variable, or in the HTTP header with every request. This example adds the key to the client. The client sends the key with every request as a part of the HTTP request header.
- Python
- JS/TS Client v2
- Go
- Java
- Curl
import weaviate
# Instantiate the client with the auth config
client = weaviate.Client(
url="https://WEAVIATE_INSTANCE_URL", # Replace with your Weaviate endpoint
auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace with your Weaviate instance API key
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
},
)
import weaviate, { ApiKey } from 'weaviate-ts-client';
// Instantiate the client with the auth config
const client = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace WEAVIATE_INSTANCE_URL with your instance URL
apiKey: new ApiKey('YOUR-WEAVIATE-API-KEY'), // Replace with your Weaviate instance API key
headers: {
'X-OpenAI-Api-Key': process.env['OPENAI_API_KEY'], // Replace with your API key
},
});
package main
import (
"context"
"fmt"
"github.com/weaviate/weaviate-go-client/v4/weaviate"
)
// Instantiate the client with the auth config
cfg := weaviate.Config{
Host:"", // Replace WEAVIATE_INSTANCE_URL with your instance URL
Scheme: "http",
AuthConfig: auth.ApiKey{Value: "YOUR-WEAVIATE-API-KEY"}, // Replace with your Weaviate instance API key
Headers: map[string]string{
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
},
}
client, err := weaviate.NewClient(cfg)
if err != nil{
fmt.Println(err)
}
import io.weaviate.client.Config;
import io.weaviate.client.WeaviateAuthClient;
Map<String, String> headers = new HashMap<String, String>() { {
put("X-OpenAI-Api-Key", "YOUR-OPENAI-API-KEY");
} };
Config config = new Config("https", "WEAVIATE_INSTANCE_URL", headers);
// Replace with your instance URL
WeaviateClient client = WeaviateAuthClient.apiKey(config, "YOUR-WEAVIATE-API-KEY"); // Replace with your Weaviate instance API key
# Replace WEAVIATE_INSTANCE_URL with your instance URL
curl https://WEAVIATE_INSTANCE_URL/v1/meta \
-H 'Content-Type: application/json' \
-H "X-OpenAI-Api-Key: YOUR-OPENAI-API-KEY" \
-H "Authorization: Bearer YOUR-WEAVIATE-API-KEY" | jq
Create the schema
The schema defines the data structure for objects in a given Weaviate class. We'll create a schema for a Wikipedia Article
class mapping the CSV columns, and using the text2vec-openai vectorizer. The schema will have two properties:
title
- article title, not vectorizedcontent
- article content, corresponding to thetext
column from the CSV
As of Weaviate 1.18, the text2vec-openai
vectorizer uses by default the same model as the OpenAI dataset, text-embedding-ada-002
. To make sure the tutorial will work the same way if this default changes (i.e. if OpenAI releases an even better-performing model and Weaviate switches to it as the default), we'll configure the schema vectorizer explicitly to use the same model:
{
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text"
}
}
}
Another detail to be careful about is how exactly we store the content_vector
embedding. Weaviate vectorizes entire objects (not properties), and it includes by default the class name in the string serialization of the object it will vectorize. Since OpenAI has provided embeddings only for the text
(content) field, we need to make sure Weaviate vectorizes an Article
object the same way. That means we need to disable including the class name in the vectorization, so we must set vectorizeClassName: false
in the text2vec-openai
section of the moduleConfig
. Together, these schema settings will look like this:
- Python
- JS/TS Client v2
# client.schema.delete_all() # ⚠️ uncomment to start from scratch by deleting ALL data
# ===== Create Article class for the schema =====
article_class = {
"class": "Article",
"description": "An article from the Simple English Wikipedia data set",
"vectorizer": "text2vec-openai",
"moduleConfig": {
# Match how OpenAI created the embeddings for the `content` (`text`) field
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": False
}
},
"properties": [
{
"name": "title",
"description": "The title of the article",
"dataType": ["text"],
# Don't vectorize the title
"moduleConfig": {"text2vec-openai": {"skip": True}}
},
{
"name": "content",
"description": "The content of the article",
"dataType": ["text"],
}
]
}
# Add the Article class to the schema
client.schema.create_class(article_class)
print('Created schema');
try {
await client.schema.classDeleter().withClassName('Article').do();
console.log('Deleted existing Articles');
} catch (e) {
if (!e.match(/could not find class/))
throw e;
}
// ===== Create Article class for the schema =====
const articleClass = {
class: 'Article',
description: 'An article from the Simple English Wikipedia data set',
vectorizer: 'text2vec-openai',
moduleConfig: {
// Match how OpenAI created the embeddings for the `content` (`text`) field
'text2vec-openai': {
model: 'ada',
modelVersion: '002',
type: 'text',
vectorizeClassName: false,
},
},
properties: [
{
name: 'title',
description: 'The title of the article',
dataType: ['string'],
// Don't vectorize the title
moduleConfig: { 'text2vec-openai': { skip: true } },
},
{
name: 'content',
description: 'The content of the article',
dataType: ['text'],
},
],
};
// Add the Article class to the schema
await client.schema.classCreator().withClass(articleClass).do();
console.log('Created schema');
To quickly check that the schema was created correctly, you can navigate to <weaviate-endpoint>/v1/schema
. For example in the Docker installation scenario, go to http://localhost:8080/v1/schema
or run,
curl -s http://localhost:8080/v1/schema | jq
The jq
command used after curl
is a handy JSON preprocessor. When simply piping some text through it, jq
returns the text pretty-printed and syntax-highlighted.
Import the articles
We're now ready to import the articles. For maximum performance, we'll load the articles into Weaviate via batch import.
- Python
- JS/TS Client v2
# ===== Import data =====
# Settings for displaying the import progress
counter = 0
interval = 100 # print progress every this many records
# Create a pandas dataframe iterator with lazy-loading,
# so we don't load all records in RAM at once.
import pandas as pd
csv_iterator = pd.read_csv(
'vector_database_wikipedia_articles_embedded.csv',
usecols=['id', 'url', 'title', 'text', 'content_vector'],
chunksize=100, # number of rows per chunk
# nrows=350 # optionally limit the number of rows to import
)
# Iterate through the dataframe chunks and add each CSV record to the batch
import ast
client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch:
for chunk in csv_iterator:
for index, row in chunk.iterrows():
properties = {
"title": row.title,
"content": row.text,
"url": row.url
}
# Convert the vector from CSV string back to array of floats
vector = ast.literal_eval(row.content_vector)
# Add the object to the batch, and set its vector embedding
batch.add_data_object(properties, "Article", vector=vector)
# Calculate and display progress
counter += 1
if counter % interval == 0:
print(f"Imported {counter} articles...")
print(f"Finished importing {counter} articles.")
// ===== Import data =====
import fs from 'fs';
import csv from 'csv-parser';
async function importCSV(filePath) {
let batcher = client.batch.objectsBatcher();
let counter = 0;
const batchSize = 100;
return new Promise((resolve, reject) => {
fs.createReadStream(filePath)
.pipe(csv())
.on('data', async (row) => {
// Import each record
const obj = {
class: 'Article',
properties: {
title: row.title,
content: row.text,
url: row.url,
},
vector: JSON.parse(row['content_vector']),
}
// Add the object to the batch queue
batcher = batcher.withObject(obj);
counter++;
// When the batch counter reaches batchSize, push the objects to Weaviate
if (counter % batchSize === 0) {
console.log(`Imported ${counter} articles...`);
// Flush the batch queue and restart it
await batcher.do();
batcher = client.batch.objectsBatcher();
}
})
.on('end', async () => {
// Flush the remaining objects
await batcher.do();
console.log(`Finished importing ${counter} articles.`);
resolve();
});
});
}
await importCSV('vector_database_wikipedia_articles_embedded.csv');
Checking the import went correctly
Two quick sanity checks that the import went as expected:
- Get the number of articles
- Get 5 articles
- Open the Weaviate Query app
- Connect to your Weaviate endpoint, either
http://localhost:8080
orhttps://WEAVIATE_INSTANCE_URL
. (Replace WEAVIATE_INSTANCE_URL with your instance URL.) - Run this GraphQL query:
query {
Aggregate { Article { meta { count } } }
Get {
Article(limit: 5) {
title
url
}
}
}
You should see the Aggregate.Article.meta.count
field equal to the number of articles you've imported (e.g. 25,000), as well as five random articles with their title
and url
fields.
Queries
Now that we have the articles imported, let's run some queries!
nearText
The nearText
filter lets us search for objects close (in vector space) to the vector representation of one or more concepts. For example, the vector for the query "modern art in Europe" would be close to the vector for the article Documenta, which describes
"one of the most important exhibitions of modern art in the world... [taking] place in Kassel, Germany".
- GraphQL
- Python
- JS/TS Client v2
- Curl
{
Get {
Article(
nearText: {concepts: ["modern art in Europe"]},
limit: 1
) {
title
content
}
}
}
import weaviate
import json
client = weaviate.Client(
url="https://WEAVIATE_INSTANCE_URL/", # replace with your Weaviate endpoint
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY" # Replace with your API key
}
)
nearText = {"concepts": ["modern art in Europe"]}
result = (
client.query
.get("Article", ["title", "content"])
.with_near_text(nearText)
.with_limit(1)
.do()
)
print(json.dumps(result, indent=4))
const nearTextResult = await client.graphql
.get()
.withClassName('Article')
.withFields('title content')
.withNearText({ concepts: ['modern art in Europe'] })
.withLimit(1)
.do();
console.log('nearText: modern art in Europe = ', JSON.stringify(nearTextResult.data['Get']['Article'], null, 2));
echo '{
"query": "{
Get {
Article(
nearText: {
concepts: [\"modern art in Europe\"],
},
limit: 1
) {
title
}
}
}"
}' | curl \
-X POST \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer learn-weaviate' \
-H "X-OpenAI-Api-Key: $OPENAI_API_KEY" \
-d @- \
https://edu-demo.weaviate.network/v1/graphql
hybrid
While nearText
uses dense vectors to find objects similar in meaning to the search query, it does not perform very well on keyword searches. For example, a nearText
search for "jackfruit" in this Simple English Wikipedia dataset, will find "cherry tomato" as the top result. For these (and indeed, most) situation, we can obtain better search results by using the hybrid
filter, which combines dense vector search with keyword search:
- GraphQL
- Python
- JS/TS Client v2
- Curl
{
Get {
Article (
hybrid: {
query: "jackfruit"
alpha: 0.5 # default 0.75
}
limit: 3
) {
title
content
_additional {score}
}
}
}
result = (
client.query
.get("Article", ["title", "content"])
.with_hybrid("jackfruit", alpha=0.5) # default 0.75
.with_limit(3)
.do()
)
print(json.dumps(result, indent=4))
const hybridResult = await client.graphql
.get()
.withClassName('Article')
.withFields('title content _additional{score}')
.withHybrid({
query: 'jackfruit',
alpha: 0.5, // optional, defaults to 0.75
})
.withLimit(3)
.do();
console.log('hybrid: jackfruit = ', JSON.stringify(hybridResult.data['Get']['Article'], null, 2));
echo '{
"query": "{
Get {
Article (
hybrid: {
query: \"jackfruit\"
alpha: 0.5
}
limit: 3
) {
title
_additional {score}
}
}
}"
}' | curl \
-X POST \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer learn-weaviate' \
-H "X-OpenAI-Api-Key: $OPENAI_API_KEY" \
-d @- \
https://edu-demo.weaviate.network/v1/graphql
Recap
In this tutorial, we've learned
- how to efficiently import large datasets using Weaviate batching and CSV lazy loading with
pandas
/csv-parser
- how to import existing vectors ("Bring Your Own Vectors")
- how to quickly check that all records were imported
- how to use
nearText
andhybrid
searches
Suggested reading
Questions and feedback
If you have any questions or feedback, let us know in the user forum.