Wikipedia with custom vectors

🚧 To be updated 🚧

This tutorial is currently being updated to reflect the latest features and improvements in Weaviate. We appreciate your patience and invite you to check back soon for the updated content.

This tutorial will show you how to import a large dataset (25k articles from Wikipedia) that already includes vectors (embeddings generated by OpenAI). We will,

download and unzip a CSV file that contains the Wikipedia articles
create a Weaviate instance
create a schema
parse the file and batch import the records, with Python and JavaScript code
make sure the data was imported correctly
run a few queries to demonstrate semantic search capabilities

Prerequisites

If you haven't yet, we recommend going through the Quickstart tutorial first to get the most out of this section.

Before you start this tutorial, make sure to have:

An OpenAI API key. Even though we already have vector embeddings generated by OpenAI, we'll need an OpenAI key to vectorize search queries, and to recalculate vector embeddings for updated object contents.
Your preferred Weaviate client library installed.

See how to delete data from previous tutorials (or previous runs of this tutorial).

You can delete any unwanted collection(s), along with the data that they contain.

Deleting a collection also deletes its objects

When you delete a collection, you delete all associated objects!

Be very careful with deletes on a production database and anywhere else that you have important data.

This code deletes a collection and its objects.

# collection_name can be a string ("Article") or a list of strings (["Article", "Category"])
client.collections.delete(collection_name)  # THIS WILL DELETE THE SPECIFIED COLLECTION(S) AND THEIR OBJECTS

# Note: you can also delete all collections in the Weaviate instance with:
# client.collections.delete_all()

API docs

# delete class "Article" - THIS WILL DELETE ALL DATA IN THIS CLASS
client.schema.delete_class("Article")  # Replace with your class name

// delete collection "Article" - THIS WILL DELETE THE COLLECTION AND ALL ITS DATA
await client.collections.delete('Article')

// you can also delete all collections of a cluster
// await client.collections.deleteAll()

// delete collection "Article" - THIS WILL DELETE THE COLLECTION AND ALL ITS DATA
await client.schema
  .classDeleter()
  .withClassName('Article')
  .do();

className := "YourClassName"

// delete the class
if err := client.Schema().ClassDeleter().WithClassName(className).Do(context.Background()); err != nil {
  // Weaviate will return a 400 if the class does not exist, so this is allowed, only return an error if it's not a 400
  if status, ok := err.(*fault.WeaviateClientError); ok && status.StatusCode != http.StatusBadRequest {
    panic(err)
  }
}

Result<Boolean> result = client.schema().classDeleter()
    .withClassName(collectionName)
    .run();

API docs

curl \
  -X DELETE \
  https://WEAVIATE_INSTANCE_URL/v1/schema/YourClassName  # Replace WEAVIATE_INSTANCE_URL with your instance URL

Download the dataset

We will use this Simple English Wikipedia dataset hosted by OpenAI (~700MB zipped, 1.7GB CSV file) that includes vector embeddings. These are the columns of interest, where content_vector is a vector embedding with 1536 elements (dimensions), generated using OpenAI's text-embedding-ada-002 model:

id	url	title	text	content_vector
1	https://simple.wikipedia.org/wiki/April	April	"April is the fourth month of the year..."	[-0.011034, -0.013401, ..., -0.009095]

If you haven't already, make sure to download the dataset and unzip the file. You should end up with vector_database_wikipedia_articles_embedded.csv in your working directory. The records are mostly (but not strictly) sorted by title.

Download Wikipedia dataset ZIP

Create a Weaviate instance

We can create a Weaviate instance locally using the embedded option on Linux (transparent and fastest), Docker on any OS (fastest import and search), or in the cloud using the Weaviate Cloud (easiest setup, but importing may be slower due to the network speed). Each option is explained on its Installation page.

text2vec-openai

If using the Docker option, make sure to select "With Modules" (instead of standalone), and the text2vec-openai module when using the Docker configurator, at the "Vectorizer & Retriever Text Module" step. At the "OpenAI Requires an API Key" step, you can choose to "provide the key with each request", as we'll do so in the next section.

Connect to the instance and OpenAI

Add the OpenAI API key to the client so you can use the OpenAI vectorizer API when you send queries to Weaviate.

The API key can be provided to Weaviate as an environment variable, or in the HTTP header with every request. This example adds the key to the client. The client sends the key with every request as a part of the HTTP request header.

Python
JS/TS Client v2
Go
Java
Curl

import weaviate

# Instantiate the client with the auth config
client = weaviate.Client(
    url="https://WEAVIATE_INSTANCE_URL",  # Replace with your Weaviate endpoint
    auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"),  # Replace with your Weaviate instance API key
    additional_headers={
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
    },
)

import weaviate, { ApiKey } from 'weaviate-ts-client';

// Instantiate the client with the auth config
const client = weaviate.client({
  scheme: 'https',
  host: 'WEAVIATE_INSTANCE_URL',  // Replace WEAVIATE_INSTANCE_URL with your instance URL
  apiKey: new ApiKey('YOUR-WEAVIATE-API-KEY'),  // Replace with your Weaviate instance API key
  headers: {
    'X-OpenAI-Api-Key': process.env['OPENAI_API_KEY'],  // Replace with your API key
  },
});

package main

import (
  "context"
  "fmt"
  "github.com/weaviate/weaviate-go-client/v5/weaviate"
)

// Instantiate the client with the auth config
cfg := weaviate.Config{
  Host:"",  // Replace WEAVIATE_INSTANCE_URL with your instance URL
  Scheme: "http",
  AuthConfig: auth.ApiKey{Value: "YOUR-WEAVIATE-API-KEY"}, // Replace with your Weaviate instance API key
  Headers: map[string]string{
    "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
    },
}

client, err := weaviate.NewClient(cfg)
if err != nil{
  fmt.Println(err)
}

import io.weaviate.client.Config;
import io.weaviate.client.WeaviateAuthClient;

Map<String, String> headers = new HashMap<String, String>() { {
  put("X-OpenAI-Api-Key", "YOUR-OPENAI-API-KEY");
} };
    Config config = new Config("https", "WEAVIATE_INSTANCE_URL", headers);
    // Replace with your instance URL
WeaviateClient client = WeaviateAuthClient.apiKey(config, "YOUR-WEAVIATE-API-KEY");  // Replace with your Weaviate instance API key

# Replace WEAVIATE_INSTANCE_URL with your instance URL

curl https://WEAVIATE_INSTANCE_URL/v1/meta \
-H 'Content-Type: application/json' \
-H "X-OpenAI-Api-Key: YOUR-OPENAI-API-KEY" \
-H "Authorization: Bearer YOUR-WEAVIATE-API-KEY" | jq

Create the schema

The schema defines the data structure for objects in a given Weaviate class. We'll create a schema for a Wikipedia Article class mapping the CSV columns, and using the text2vec-openai vectorizer. The schema will have two properties:

title - article title, not vectorized
content - article content, corresponding to the text column from the CSV

As of Weaviate 1.18, the text2vec-openai vectorizer uses by default the same model as the OpenAI dataset, text-embedding-ada-002. To make sure the tutorial will work the same way if this default changes (i.e. if OpenAI releases an even better-performing model and Weaviate switches to it as the default), we'll configure the schema vectorizer explicitly to use the same model:

{
  "moduleConfig": {
    "text2vec-openai": {
      "model": "ada",
      "modelVersion": "002",
      "type": "text"
    }
  }
}

Another detail to be careful about is how exactly we store the content_vector embedding. Weaviate vectorizes entire objects (not properties), and it includes by default the class name in the string serialization of the object it will vectorize. Since OpenAI has provided embeddings only for the text (content) field, we need to make sure Weaviate vectorizes an Article object the same way. That means we need to disable including the class name in the vectorization, so we must set vectorizeClassName: false in the text2vec-openai section of the moduleConfig. Together, these schema settings will look like this:

Python
JS/TS Client v2

# client.schema.delete_all()  # ⚠️ uncomment to start from scratch by deleting ALL data

# ===== Create Article class for the schema =====
article_class = {
    "class": "Article",
    "description": "An article from the Simple English Wikipedia data set",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        # Match how OpenAI created the embeddings for the `content` (`text`) field
        "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "vectorizeClassName": False
        }
    },
    "properties": [
        {
            "name": "title",
            "description": "The title of the article",
            "dataType": ["text"],
            # Don't vectorize the title
            "moduleConfig": {"text2vec-openai": {"skip": True}}
        },
        {
            "name": "content",
            "description": "The content of the article",
            "dataType": ["text"],
        }
    ]
}

# Add the Article class to the schema
client.schema.create_class(article_class)
print('Created schema');

try {
  await client.schema.classDeleter().withClassName('Article').do();
  console.log('Deleted existing Articles');
} catch (e) {
  if (!e.match(/could not find class/))
    throw e;
}

// ===== Create Article class for the schema =====
const articleClass = {
  class: 'Article',
  description: 'An article from the Simple English Wikipedia data set',
  vectorizer: 'text2vec-openai',
  moduleConfig: {
    // Match how OpenAI created the embeddings for the `content` (`text`) field
    'text2vec-openai': {
      model: 'ada',
      modelVersion: '002',
      type: 'text',
      vectorizeClassName: false,
    },
  },
  properties: [
    {
      name: 'title',
      description: 'The title of the article',
      dataType: ['string'],
      // Don't vectorize the title
      moduleConfig: { 'text2vec-openai': { skip: true } },
    },
    {
      name: 'content',
      description: 'The content of the article',
      dataType: ['text'],
    },
  ],
};

// Add the Article class to the schema
await client.schema.classCreator().withClass(articleClass).do();
console.log('Created schema');

To quickly check that the schema was created correctly, you can navigate to <weaviate-endpoint>/v1/schema. For example in the Docker installation scenario, go to http://localhost:8080/v1/schema or run,

curl -s http://localhost:8080/v1/schema | jq

The jq command used after curl is a handy JSON preprocessor. When simply piping some text through it, jq returns the text pretty-printed and syntax-highlighted.

Import the articles

We're now ready to import the articles. For maximum performance, we'll load the articles into Weaviate via batch import.

Python
JS/TS Client v2

# ===== Import data =====
# Settings for displaying the import progress
counter = 0
interval = 100  # print progress every this many records

# Create a pandas dataframe iterator with lazy-loading,
# so we don't load all records in RAM at once.
import pandas as pd
csv_iterator = pd.read_csv(
    'vector_database_wikipedia_articles_embedded.csv',
    usecols=['id', 'url', 'title', 'text', 'content_vector'],
    chunksize=100,  # number of rows per chunk
    # nrows=350  # optionally limit the number of rows to import
)

# Iterate through the dataframe chunks and add each CSV record to the batch
import ast
client.batch.configure(batch_size=100)  # Configure batch
with client.batch as batch:
  for chunk in csv_iterator:
      for index, row in chunk.iterrows():

          properties = {
              "title": row.title,
              "content": row.text,
              "url": row.url
          }

          # Convert the vector from CSV string back to array of floats
          vector = ast.literal_eval(row.content_vector)

          # Add the object to the batch, and set its vector embedding
          batch.add_data_object(properties, "Article", vector=vector)

          # Calculate and display progress
          counter += 1
          if counter % interval == 0:
              print(f"Imported {counter} articles...")
print(f"Finished importing {counter} articles.")

// ===== Import data =====
import fs from 'fs';
import csv from 'csv-parser';

async function importCSV(filePath) {
  let batcher = client.batch.objectsBatcher();
  let counter = 0;
  const batchSize = 100;

  return new Promise((resolve, reject) => {
    fs.createReadStream(filePath)
      .pipe(csv())
      .on('data', async (row) => {
        // Import each record
        const obj = {
          class: 'Article',
          properties: {
            title: row.title,
            content: row.text,
            url: row.url,
          },
          vector: JSON.parse(row['content_vector']),
        }
        // Add the object to the batch queue
        batcher = batcher.withObject(obj);
        counter++;

        // When the batch counter reaches batchSize, push the objects to Weaviate
        if (counter % batchSize === 0) {
          console.log(`Imported ${counter} articles...`);
          // Flush the batch queue and restart it
          await batcher.do();
          batcher = client.batch.objectsBatcher();
        }
      })
      .on('end', async () => {
        // Flush the remaining objects
        await batcher.do();
        console.log(`Finished importing ${counter} articles.`);
        resolve();
      });
  });
}

await importCSV('vector_database_wikipedia_articles_embedded.csv');

Checking the import went correctly

Two quick sanity checks that the import went as expected:

Get the number of articles
Get 5 articles

Open the Weaviate Query app
Connect to your Weaviate endpoint, either http://localhost:8080 or https://WEAVIATE_INSTANCE_URL. (Replace WEAVIATE_INSTANCE_URL with your instance URL.)
Run this GraphQL query:

query {
  Aggregate { Article { meta { count } } }

  Get {
    Article(limit: 5) {
      title
      url
    }
  }
}

You should see the Aggregate.Article.meta.count field equal to the number of articles you've imported (e.g. 25,000), as well as five random articles with their title and url fields.

Queries

Now that we have the articles imported, let's run some queries!

nearText

The nearText filter lets us search for objects close (in vector space) to the vector embedding of one or more concepts. For example, the vector for the query "modern art in Europe" would be close to the vector for the article Documenta, which describes

"one of the most important exhibitions of modern art in the world... [taking] place in Kassel, Germany".

GraphQL
Python
JS/TS Client v2
Curl

{
  Get {
    Article(
      nearText: {concepts: ["modern art in Europe"]},
      limit: 1
    ) {
      title
      content
    }
  }
}

import weaviate
import json

client = weaviate.Client(
    url="https://WEAVIATE_INSTANCE_URL/",  # replace with your Weaviate endpoint
    additional_headers={
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"  # Replace with your API key
    }
)

nearText = {"concepts": ["modern art in Europe"]}

result = (
    client.query
    .get("Article", ["title", "content"])
    .with_near_text(nearText)
    .with_limit(1)
    .do()
)

print(json.dumps(result, indent=4))

const nearTextResult = await client.graphql
  .get()
  .withClassName('Article')
  .withFields('title content')
  .withNearText({ concepts: ['modern art in Europe'] })
  .withLimit(1)
  .do();

console.log('nearText: modern art in Europe = ', JSON.stringify(nearTextResult.data['Get']['Article'], null, 2));

echo '{
  "query": "{
    Get {
      Article(
        nearText: {
          concepts: [\"modern art in Europe\"],
        },
        limit: 1
      ) {
        title
      }
    }
  }"
}' | curl \
    -X POST \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer learn-weaviate' \
    -H "X-OpenAI-Api-Key: $OPENAI_API_KEY" \
    -d @- \
    https://edu-demo.weaviate.network/v1/graphql

hybrid

While nearText uses dense vectors to find objects similar in meaning to the search query, it does not perform very well on keyword searches. For example, a nearText search for "jackfruit" in this Simple English Wikipedia dataset, will find "cherry tomato" as the top result. For these (and indeed, most) situation, we can obtain better search results by using the hybrid filter, which combines dense vector search with keyword search:

GraphQL
Python
JS/TS Client v2
Curl

{
  Get {
    Article (
      hybrid: {
        query: "jackfruit"
        alpha: 0.5  # default 0.75
      }
      limit: 3
    ) {
      title
      content
      _additional {score}
    }
  }
}

result = (
    client.query
    .get("Article", ["title", "content"])
    .with_hybrid("jackfruit", alpha=0.5)  # default 0.75
    .with_limit(3)
    .do()
)

print(json.dumps(result, indent=4))

const hybridResult = await client.graphql
  .get()
  .withClassName('Article')
  .withFields('title content _additional{score}')
  .withHybrid({
    query: 'jackfruit',
    alpha: 0.5,  // optional, defaults to 0.75
  })
  .withLimit(3)
  .do();

console.log('hybrid: jackfruit = ', JSON.stringify(hybridResult.data['Get']['Article'], null, 2));

echo '{
  "query": "{
      Get {
        Article (
          hybrid: {
            query: \"jackfruit\"
            alpha: 0.5
          }
          limit: 3
        ) {
          title
          _additional {score}
      }
    }
  }"
}' | curl \
    -X POST \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer learn-weaviate' \
    -H "X-OpenAI-Api-Key: $OPENAI_API_KEY" \
    -d @- \
    https://edu-demo.weaviate.network/v1/graphql

Recap

In this tutorial, we've learned

how to efficiently import large datasets using Weaviate batching and CSV lazy loading with pandas / csv-parser
how to import existing vectors ("Bring Your Own Vectors")
how to quickly check that all records were imported
how to use nearText and hybrid searches

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Prerequisites​

Download the dataset​

Create a Weaviate instance​

Connect to the instance and OpenAI​

Create the schema​

Import the articles​

Checking the import went correctly​

Queries​

nearText​

hybrid​

Recap​

Suggested reading​

Questions and feedback​