Skip to main content

Wikipedia with custom vectors

Overview

This tutorial will show you how to import a large dataset (25k articles from Wikipedia) that already includes vectors (embeddings generated by OpenAI). We will,

  • download and unzip a CSV file that contains the Wikipedia articles
  • create a Weaviate instance
  • create a schema
  • parse the file and batch import the records, with Python and JavaScript code
  • make sure the data was imported correctly
  • run a few queries to demonstrate semantic search capabilities

Prerequisites

Prerequisites

If you haven't yet, we recommend going through the Quickstart tutorial first to get the most out of this section.

Before you start this tutorial, make sure to have:

  • An OpenAI API key. Even though we already have vector embeddings generated by OpenAI, we'll need an OpenAI key to vectorize search queries, and to recalculate vector embeddings for updated object contents.
  • Your preferred Weaviate client library installed.
See how to delete data from previous tutorials (or previous runs of this tutorial).

You can delete any unwanted collection(s), along with the data that they contain.

Deleting a collection also deletes its objects

When you delete a collection, you delete all associated objects!

Be very careful with deletes on a production database and anywhere else that you have important data.

This code deletes a collection and its objects.

    # delete collection "Article" - THIS WILL DELETE THE COLLECTION AND ALL ITS DATA
client.collections.delete("Article") # Replace with your collection name

Download the dataset

We will use this Simple English Wikipedia dataset hosted by OpenAI (~700MB zipped, 1.7GB CSV file) that includes vector embeddings. These are the columns of interest, where content_vector is a vector embedding with 1536 elements (dimensions), generated using OpenAI's text-embedding-ada-002 model:

idurltitletextcontent_vector
1https://simple.wikipedia.org/wiki/AprilApril"April is the fourth month of the year..."[-0.011034, -0.013401, ..., -0.009095]

If you haven't already, make sure to download the dataset and unzip the file. You should end up with vector_database_wikipedia_articles_embedded.csv in your working directory. The records are mostly (but not strictly) sorted by title.

Download Wikipedia dataset ZIP

Create a Weaviate instance

We can create a Weaviate instance locally using the embedded option on Linux (transparent and fastest), Docker on any OS (fastest import and search), or in the cloud using the Weaviate Cloud Services (easiest setup, but importing may be slower due to the network speed). Each option is explained on its Installation page.

text2vec-openai

If using the Docker option, make sure to select "With Modules" (instead of standalone), and the text2vec-openai module when using the Docker configurator, at the "Vectorizer & Retriever Text Module" step. At the "OpenAI Requires an API Key" step, you can choose to "provide the key with each request", as we'll do so in the next section.

Connect to the instance and OpenAI

To pave the way for using OpenAI later when querying, let's make sure we provide the OpenAI API key to the client.

The API key can be provided to Weaviate as an environment variable, or in the HTTP header with every request. Here, we will add it to the Weaviate client at instantiation as shown below. The client will then send the key as a part of the HTTP request header with every request.

import weaviate

# Instantiate the client with the auth config
client = weaviate.Client(
url="https://some-endpoint.weaviate.network", # Replace w/ your endpoint
auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace w/ your Weaviate instance API key
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
},
)
Running GraphQL OpenAI queries against local Docker instances

You can explore data in a self-hosted Weaviate instance using Weaviate's GraphQL interface by connecting to its endpoint, such as localhost:8080. If you get an error about a missing OpenAI key (e.g. when running a nearText query), you can supply the OpenAI API key by exporting it via the OPENAI_APIKEY environment variable before running docker compose up.

Create the schema

The schema defines the data structure for objects in a given Weaviate class. We'll create a schema for a Wikipedia Article class mapping the CSV columns, and using the text2vec-openai vectorizer. The schema will have two properties:

  • title - article title, not vectorized
  • content - article content, corresponding to the text column from the CSV

As of Weaviate 1.18, the text2vec-openai vectorizer uses by default the same model as the OpenAI dataset, text-embedding-ada-002. To make sure the tutorial will work the same way if this default changes (i.e. if OpenAI releases an even better-performing model and Weaviate switches to it as the default), we'll configure the schema vectorizer explicitly to use the same model:

{
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text"
}
}
}

Another detail to be careful about is how exactly we store the content_vector embedding. Weaviate vectorizes entire objects (not properties), and it includes by default the class name in the string serialization of the object it will vectorize. Since OpenAI has provided embeddings only for the text (content) field, we need to make sure Weaviate vectorizes an Article object the same way. That means we need to disable including the class name in the vectorization, so we must set vectorizeClassName: false in the text2vec-openai section of the moduleConfig. Together, these schema settings will look like this:

# client.schema.delete_all()  # ⚠️ uncomment to start from scratch by deleting ALL data

# ===== Create Article class for the schema =====
article_class = {
"class": "Article",
"description": "An article from the Simple English Wikipedia data set",
"vectorizer": "text2vec-openai",
"moduleConfig": {
# Match how OpenAI created the embeddings for the `content` (`text`) field
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": False
}
},
"properties": [
{
"name": "title",
"description": "The title of the article",
"dataType": ["text"],
# Don't vectorize the title
"moduleConfig": {"text2vec-openai": {"skip": True}}
},
{
"name": "content",
"description": "The content of the article",
"dataType": ["text"],
}
]
}

# Add the Article class to the schema
client.schema.create_class(article_class)
print('Created schema');

To quickly check that the schema was created correctly, you can navigate to <weaviate-endpoint>/v1/schema. For example in the Docker installation scenario, go to http://localhost:8080/v1/schema or run,

curl -s http://localhost:8080/v1/schema | jq
jq

The jq command used after curl is a handy JSON preprocessor. When simply piping some text through it, jq returns the text pretty-printed and syntax-highlighted.

Import the articles

We're now ready to import the articles. For maximum performance, we'll load the articles into Weaviate via batch import.

# ===== Import data =====
# Settings for displaying the import progress
counter = 0
interval = 100 # print progress every this many records

# Create a pandas dataframe iterator with lazy-loading,
# so we don't load all records in RAM at once.
import pandas as pd
csv_iterator = pd.read_csv(
'vector_database_wikipedia_articles_embedded.csv',
usecols=['id', 'url', 'title', 'text', 'content_vector'],
chunksize=100, # number of rows per chunk
# nrows=350 # optionally limit the number of rows to import
)

# Iterate through the dataframe chunks and add each CSV record to the batch
import ast
client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch:
for chunk in csv_iterator:
for index, row in chunk.iterrows():

properties = {
"title": row.title,
"content": row.text,
"url": row.url
}

# Convert the vector from CSV string back to array of floats
vector = ast.literal_eval(row.content_vector)

# Add the object to the batch, and set its vector embedding
batch.add_data_object(properties, "Article", vector=vector)

# Calculate and display progress
counter += 1
if counter % interval == 0:
print(f"Imported {counter} articles...")
print(f"Finished importing {counter} articles.")

Checking the import went correctly

Two quick sanity checks that the import went as expected:

  1. Get the number of articles
  2. Get 5 articles

Go to the Weaviate GraphQL console, connect to your Weaviate endpoint (e.g. http://localhost:8080 or https://some-endpoint.weaviate.network), then run the GraphQL query below:

query {
Aggregate { Article { meta { count } } }

Get {
Article(limit: 5) {
title
url
}
}
}

You should see the Aggregate.Article.meta.count field equal to the number of articles you've imported (e.g. 25,000), as well as five random articles with their title and url fields.

Queries

Now that we have the articles imported, let's run some queries!

nearText

The nearText filter lets us search for objects close (in vector space) to the vector representation of one or more concepts. For example, the vector for the query "modern art in Europe" would be close to the vector for the article Documenta, which describes

"one of the most important exhibitions of modern art in the world... [taking] place in Kassel, Germany".

{
Get {
Article(
nearText: {concepts: ["modern art in Europe"]},
limit: 1
) {
title
content
}
}
}

hybrid

While nearText uses dense vectors to find objects similar in meaning to the search query, it does not perform very well on keyword searches. For example, a nearText search for "jackfruit" in this Simple English Wikipedia dataset, will find "cherry tomato" as the top result. For these (and indeed, most) situation, we can obtain better search results by using the hybrid filter, which combines dense vector search with keyword search:

{
Get {
Article (
hybrid: {
query: "jackfruit"
alpha: 0.5 # default 0.75
}
limit: 3
) {
title
content
_additional {score}
}
}
}

Recap

In this tutorial, we've learned

  • how to efficiently import large datasets using Weaviate batching and CSV lazy loading with pandas / csv-parser
  • how to import existing vectors ("Bring Your Own Vectors")
  • how to quickly check that all records were imported
  • how to use nearText and hybrid searches

Suggested reading