Bring your own vectors
Weaviate is a vector database. Vector databases store data objects and vectors that represent those objects. When you import a data object, you normally include a vector representation of that object as well. The vector representation is also called an "embedding."
Later, when you work with the stored vectors, Weaviate uses a vectorized version of your query to search the vector space.
This guide discusses importing data that already has vectors. An alternative approach is to use a vectorizer model that generates vectors at import time. Queries have to be vectorized using the same vectorizer as the data objects. When you define your collection schema, be sure to define the same vectorizer that you use to vectorize the data.
Weaviate supports embeddings generated by popular third party services and custom vector embeddings that you create. For details on how to connect your client application to a vectorizer, see the follow pages:
Guide outline
This guide uses a free sandbox in Weaviate Cloud. If you don't have a Weaviate Cloud account, you can create one or else follow along on a Weaviate instance of your own.
Follow these steps to import data that includes vectors.
Create a sandbox instance
Follow these steps to create a sandbox instance.
Login to Weaviate Cloud
- Open the WCD login page in a browser.
- Click
Login to Weaviate Cloud Services
. - Enter your email address and password to authenticate.
Create a sandbox cluster
To create a cluster, click the 'Create cluster' button on the WCD Dashboard page.
- Select the "Free sandbox" tab.
- Give your cluster a name. WCD adds a random suffix to sandbox cluster names to ensure uniqueness.
- Verify that "Enable Authentication?" is set to "Yes".
- Click create.
It takes a minute or two to create the new cluster. When the cluster is ready, WCD displays a check mark (✔️
) next to the cluster name.
Connect to your sandbox
Follow these steps to connect to your sandbox instance.
Get the connection details
To connect to your sandbox instance, you need the following information:
- The Weaviate sandbox URL
- The Weaviate sandbox API key
To get the cluster URL and authentication details, follow these steps:
- Click the
Details
button to open the Details panel.
- Copy the REST endpoint.
- To get the API keys, click the
API keys
button. Copy the Weaviate API key.
Connect to a Weaviate instance
To connect to your sandbox, edit this sample code to use your Weaviate sandbox URL and your Weaviate API key. You will use this client later when you create a schema and when you upload your data.
This guide demonstrates how import data objects that already have vectors. If you import data that has to be vectorized, your client connection code should also include the API key for the vectorizer API.
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
- Go
- Curl
import weaviate, os
import weaviate.classes as wvc
# Set these environment variables
URL = os.getenv("WCS_URL")
APIKEY = os.getenv("WCS_API_KEY")
# Connect to Weaviate Cloud
client = weaviate.connect_to_wcs(
cluster_url=URL,
auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
)
# Check connection
client.is_ready()
import weaviate
import json
client = weaviate.Client(
url = "https://WEAVIATE_INSTANCE_URL", # Replace with your Weaviate endpoint
auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace with your Weaviate instance API key
)
import weaviate, { WeaviateClient } from 'weaviate-client'
const client: WeaviateClient = await weaviate.connectToWCS(
'https://WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
{
authCredentials: new weaviate.ApiKey('YOUR-WEAVIATE-API-KEY'), // Replace with your Weaviate instance API key
headers: {
'X-OpenAI-Api-Key': process.env.OPENAI_API_KEY || '', // Replace with your inference API key
}
}
)
import weaviate, { WeaviateClient, ObjectsBatcher, ApiKey } from 'weaviate-ts-client';
import fetch from 'node-fetch';
const client: WeaviateClient = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
apiKey: new ApiKey('YOUR-WEAVIATE-API-KEY'), // Replace with your Weaviate instance API key
});
package main
import (
"context"
"github.com/weaviate/weaviate-go-client/v4/weaviate"
"github.com/weaviate/weaviate-go-client/v4/weaviate/auth"
"github.com/weaviate/weaviate/entities/models"
)
func main() {
cfg := weaviate.Config{
Host: "WEAVIATE_INSTANCE_URL/", // Replace with your Weaviate endpoint
Scheme: "https",
AuthConfig: auth.ApiKey{Value: "YOUR-WEAVIATE-API-KEY"}, // Replace with your Weaviate instance API key
}
client, err := weaviate.NewClient(cfg)
if err != nil {
panic(err)
}
}
- With
curl
, add the API key to the header as shown below:
echo '{
"query": "<QUERY>"
}' | curl \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR-WEAVIATE-API-KEY" \
-d @- \
https://WEAVIATE_INSTANCE_URL/v1/graphql # Replace WEAVIATE_INSTANCE_URL with your instance URL
Prepare the collection
Weaviate stores data in collections. Each object has a set of properties and a vector representation. Before you import data, you should create a collection schema to define collection properties and the vectorizer. Vectorizers are defined at the collection level. You can also define specific details at the property level.
When you import data objects that don't have vectors, Weaviate uses the vectorizer details to create vectors at import time. If the imported data objects do have vectors, Weaviate uses those vectors instead of generating new ones.
The vectorizer that you specify in the collection schema must be the same vectorizer that you use to vectorize the data objects. If the vectorizers are different, vector search results are meaningless.
Create a collection schema
Collection schemas are highly configurable. If you don't provide values for all of the parameters, the auto-schema feature attempts to provide the missing values during import.
In this example, the vectors are generated by an OpenAI model, ada-002
. Weaviate provides an OpenAI integration, text2vec-openai
, that can access ada-002
. The schema definition specifies the text2vec-openai
vectorizer so queries, and additional inserts, will use the same vectorizer.
Objects have vectors and properties. Be careful not to specify your vectors as properties. Weaviate processes vector embeddings and properties differently.
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
# Set these environment variables
# WCS_URL - The URL for your Weaviate instance
# WCS_API_KEY - The API key for your Weaviate instance
# OPENAI_API_KEY - The API key for your OpenAI account
import weaviate, os
import weaviate.classes as wvc
# The with-as context manager closes the connect when your code exits
with weaviate.connect_to_wcs(
cluster_url=os.getenv("WCS_URL"),
auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
) as client:
questions = client.collections.create(
name="Question",
properties=[
wvc.config.Property(
name="question",
description="What to ask",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD,
index_searchable=True,
index_filterable=True,
),
wvc.config.Property(
name="answer",
description="The clue",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD,
index_searchable=True,
index_filterable=True,
),
wvc.config.Property(
name="category",
description="The subject",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD,
index_searchable=True,
index_filterable=True,
),
],
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
generative_config=wvc.config.Configure.Generative.openai(),
)
# Set these environment variables
# WCS_URL - The URL for your Weaviate instance
# WCS_API_KEY - The API key for your Weaviate instance
# OPENAI_API_KEY - The API key for your OpenAI account
import weaviate
import json
client = weaviate.Client(
url = WCS_URL,
auth_client_secret=weaviate.auth.AuthApiKey(api_key=WCS_API_KEY),
additional_headers = {
"X-OpenAI-Api-Key": OPENAI_API_KEY
}
)
class_obj = {
"class": "Question",
"description": "Information from a Jeopardy! question",
"vectorizer": "text2vec-openai",
"moduleConfig": {
"generative-openai": {}
},
"properties": [
{
"name": "question",
"dataType": ["text"],
"description": "What to ask",
"moduleConfig": {
"text2vec-openai": {
"vectorizePropertyName": True,
"tokenization": "word"
}
}
},
{
"name": "answer",
"dataType": ["text"],
"description": "The clue",
"moduleConfig": {
"text2vec-openai": {
"vectorizePropertyName": False,
"tokenization": "word"
}
}
},
{
"name": "category",
"dataType": ["text"],
"description": "The subject",
"moduleConfig": {
"text2vec-openai": {
"vectorizePropertyName": False,
"tokenization": "word"
}
}
},
],
}
# add the schema
client.schema.create_class(class_obj)
// Set these environment variables
// WCS_URL - The URL for your Weaviate instance
// WCS_API_KEY - The API key for your Weaviate instance
// OPENAI_API_KEY - The API key for your OpenAI account
import weaviate, { WeaviateClient } from 'weaviate-client';
const WCS_URL=process.env["WCS_URL"];
const WCS_API_KEY=process.env["WCS_API_KEY"];
const OPENAI_API_KEY=process.env["OPENAI_API_KEY"];
const client: WeaviateClient = await weaviate.connectToWCS(
WCS_URL,
{
authCredentials: new weaviate.ApiKey(WCS_API_KEY),
headers: {
'X-OpenAI-Api-Key': OPENAI_API_KEY,
}
}
)
const newCollection = await client.collections.create({
name: 'Question',
properties: [
{
name: 'question',
description: 'What to ask',
dataType: weaviate.configure.dataType.TEXT,
vectorizer: 'text2vec-openai',
vectorizePropertyName: true,
tokenization: 'word',
},
{
name: 'answer',
description: 'The clue',
dataType: weaviate.configure.dataType.TEXT,
vectorizer: 'text2vec-openai',
tokenization: 'word',
skipVectorization: true
},
{
name: 'category',
description: 'The subject',
dataType: weaviate.configure.dataType.TEXT,
vectorizer: 'text2vec-openai',
tokenization: 'word',
skipVectorization: true
},
],
})
// Display schema as verification
const collectionDefinition = await client.collections.get('Question')
console.log(await collectionDefinition.config.get())
// Set these environment variables
// WCS_URL - The URL for your Weaviate instance
// WCS_API_KEY - The API key for your Weaviate instance
// OPENAI_API_KEY - The API key for your OpenAI account
Import the data
The example data is based on a set of ten questions from the "Jeopardy!" television program. Each of the data objects contains the following elements:
- A vector embedding
- A question
- A category
- An answer
You can use any vectorizer. In this example, OpenAI is the vectorizer. The question, answer, and category are the raw data. The vector embedding is the vector OpenAI returns after processing the raw data.
Create a JSON formatted data file to use in your import. In this example, the data import file is already prepared. The JSON file encodes this data:
View the dataset
Category | Question | Answer | Vector | |
---|---|---|---|---|
0 | SCIENCE | This organ removes excess glucose from the blood & stores it as glycogen | Liver | [ -0.006632288, -0.0042016874, ..., -0.020163147 ] |
1 | ANIMALS | It's the only living mammal in the order Proboseidea | Elephant | [ -0.0166891, -0.00092290324, ..., -0.032253385 ] |
2 | ANIMALS | The gavial looks very much like a crocodile except for this bodily feature | the nose or snout | [ -0.015592773, 0.019883318, ..., 0.0033349802 ] |
3 | ANIMALS | Weighing around a ton, the eland is the largest species of this animal in Africa | Antelope | [ 0.014535263, -0.016103541, ..., -0.025882969 ] |
4 | ANIMALS | Heaviest of all poisonous snakes is this North American rattlesnake | the diamondback rattler | [ -0.0030859283, 0.015239313, ..., -0.021798335 ] |
5 | SCIENCE | 2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification | species | [ -0.0090561025, 0.011155112, ..., -0.023036297 ] |
6 | SCIENCE | A metal that is "ductile" can be pulled into this while cold & under pressure | wire | [ -0.02735741, 0.01199829, ..., 0.010396339 ] |
7 | SCIENCE | In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance | DNA | [ -0.014227471, 0.020493254, ..., -0.0027445166 ] |
8 | SCIENCE | Changes in the tropospheric layer of this are what gives us weather | the atmosphere | [ 0.009625228, 0.027518686, ..., -0.0068922946 ] |
9 | SCIENCE | In 70-degree air, a plane traveling at about 1,130 feet per second breaks it | Sound barrier | [ -0.0013459147, 0.0018580769, ..., -0.033439033 ] |
This batch import code imports the question objects, including their vectors. Batch import is more efficient than importing individual objects. You should use batch imports when you import large amounts of data.
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
import requests
fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json" # This file includes pre-generated vectors
url = f"https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}"
resp = requests.get(url)
data = json.loads(resp.text) # Load data
question_objs = list()
for i, d in enumerate(data):
question_objs.append(wvc.data.DataObject(
properties={
"answer": d["Answer"],
"question": d["Question"],
"category": d["Category"],
},
vector=d["vector"]
))
questions = client.collections.get("Question")
questions.data.insert_many(question_objs) # This uses batching under the hood
import requests
fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"
url = f"https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}"
resp = requests.get(url)
data = json.loads(resp.text)
# Configure a batch process
client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch:
# Batch import all Questions
for i, d in enumerate(data):
print(f"importing question: {i+1}")
properties = {
"answer": d["Answer"],
"question": d["Question"],
"category": d["Category"],
}
batch.add_data_object(properties, "Question", vector=d["Vector"])
npm install node-fetch
type JeopardyItem = {
Answer: string;
Question: string;
Category: string;
Vector: number[],
}
async function getJsonData(): Promise<JeopardyItem[]> {
const file = await fetch('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json');
return file.json() as unknown as JeopardyItem[];
}
async function importQuestionsWithVectors() {
// Get the questions directly from the URL
const data = await getJsonData();
const jeopardyCollection = client.collections.get('JeopardyQuestion');
const res = await jeopardyCollection.data.insertMany(data)
console.log(`Finished importing ${res.allResponses.length} objects.`);
}
await importQuestionsWithVectors();
npm install node-fetch
type JeopardyItem = {
Answer: string;
Question: string;
Category: string;
Vector: number[],
}
async function getJsonData(): Promise<JeopardyItem[]> {
const file = await fetch('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json');
return file.json() as unknown as JeopardyItem[];
}
async function importQuestionsWithVectors() {
// Get the questions directly from the URL
const data = await getJsonData();
// Prepare a batcher. Even though this dataset is tiny, this is the best practice for import.
let batcher: ObjectsBatcher = client.batch.objectsBatcher();
let counter: number = 0;
let batchSize: number = 100;
for (const item of data) {
// Construct the object to add to the batch
const obj = {
class: "Question",
properties: {
answer: item.Answer,
question: item.Question,
category: item.Category,
},
vector: item.Vector,
}
// add the object to the batch queue
batcher = batcher.withObject(obj);
// When the batch counter reaches batchSize, push the objects to Weaviate
if (counter++ % batchSize === 0) {
// Flush the batch queue and restart it
await batcher.do();
batcher = client.batch.objectsBatcher();
}
}
// Flush the remaining objects
await batcher.do();
console.log(`Finished importing ${counter} objects.`);
}
await importQuestionsWithVectors();
Summary
Weaviate provides API integrations with many model providers. These providers can vectorize your data at import time. However, if you already have vectorized data, you can import the vectors when you import the underlying data objects.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.