Skip to main content

Bring your own vectors

Weaviate is a vector database. Vector databases store data objects and vectors that represent those objects. When you import a data object, you normally include a vector representation of that object as well. The vector representation is also called an "embedding."

Later, when you work with the stored vectors, Weaviate uses a vectorized version of your query to search the vector space.

This guide discusses importing data that already has vectors. An alternative approach is to use a vectorizer model that generates vectors at import time. Queries have to be vectorized using the same vectorizer as the data objects. When you define your collection schema, be sure to define the same vectorizer that you use to vectorize the data.

Weaviate supports embeddings generated by popular third party services and custom vector embeddings that you create. For details on how to connect your client application to a vectorizer, see the follow pages:

Guide outline

This guide uses a free sandbox in Weaviate Cloud. If you don't have a Weaviate Cloud account, you can create one or else follow along on a Weaviate instance of your own.

Follow these steps to import data that includes vectors.

  1. Create a sandbox instance
  2. Connect to your sandbox
  3. Prepare the collection
  4. Import the data

Create a sandbox instance

Follow these steps to create a sandbox instance.

Login to Weaviate Cloud

  1. Open the WCD login page in a browser.
  2. Click Login to Weaviate Cloud Services.
  3. Enter your email address and password to authenticate.

Create a sandbox cluster

To create a cluster, click the 'Create cluster' button on the WCD Dashboard page.

Create cluster button
  1. Select the "Free sandbox" tab.
  2. Give your cluster a name. WCD adds a random suffix to sandbox cluster names to ensure uniqueness.
  3. Verify that "Enable Authentication?" is set to "Yes".
  4. Click create.

It takes a minute or two to create the new cluster. When the cluster is ready, WCD displays a check mark (✔️) next to the cluster name.

Connect to your sandbox

Follow these steps to connect to your sandbox instance.

Get the connection details

To connect to your sandbox instance, you need the following information:

  • The Weaviate sandbox URL
  • The Weaviate sandbox API key

To get the cluster URL and authentication details, follow these steps:

  1. Click the Details button to open the Details panel.
Details button
  1. Copy the REST endpoint.
Details button
  1. To get the API keys, click the API keys button. Copy the Weaviate API key.
Details API Key

Connect to a Weaviate instance

To connect to your sandbox, edit this sample code to use your Weaviate sandbox URL and your Weaviate API key. You will use this client later when you create a schema and when you upload your data.

This guide demonstrates how import data objects that already have vectors. If you import data that has to be vectorized, your client connection code should also include the API key for the vectorizer API.

import weaviate, os
import weaviate.classes as wvc

# Set these environment variables
URL = os.getenv("WCS_URL")
APIKEY = os.getenv("WCS_API_KEY")

# Connect to Weaviate Cloud
client = weaviate.connect_to_wcs(
cluster_url=URL,
auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
)

# Check connection
client.is_ready()

Prepare the collection

Weaviate stores data in collections. Each object has a set of properties and a vector representation. Before you import data, you should create a collection schema to define collection properties and the vectorizer. Vectorizers are defined at the collection level. You can also define specific details at the property level.

When you import data objects that don't have vectors, Weaviate uses the vectorizer details to create vectors at import time. If the imported data objects do have vectors, Weaviate uses those vectors instead of generating new ones.

The vectorizer that you specify in the collection schema must be the same vectorizer that you use to vectorize the data objects. If the vectorizers are different, vector search results are meaningless.

Create a collection schema

Collection schemas are highly configurable. If you don't provide values for all of the parameters, the auto-schema feature attempts to provide the missing values during import.

In this example, the vectors are generated by an OpenAI model, ada-002. Weaviate provides an OpenAI integration, text2vec-openai, that can access ada-002. The schema definition specifies the text2vec-openai vectorizer so queries, and additional inserts, will use the same vectorizer.

Objects have vectors and properties. Be careful not to specify your vectors as properties. Weaviate processes vector embeddings and properties differently.

# Set these environment variables
# WCS_URL - The URL for your Weaviate instance
# WCS_API_KEY - The API key for your Weaviate instance
# OPENAI_API_KEY - The API key for your OpenAI account

import weaviate, os
import weaviate.classes as wvc

# The with-as context manager closes the connect when your code exits
with weaviate.connect_to_wcs(
cluster_url=os.getenv("WCS_URL"),
auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
) as client:
questions = client.collections.create(
name="Question",
properties=[
wvc.config.Property(
name="question",
description="What to ask",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD,
index_searchable=True,
index_filterable=True,
),
wvc.config.Property(
name="answer",
description="The clue",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD,
index_searchable=True,
index_filterable=True,
),
wvc.config.Property(
name="category",
description="The subject",
data_type=wvc.config.DataType.TEXT,
tokenization=wvc.config.Tokenization.WORD,
index_searchable=True,
index_filterable=True,
),
],
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
generative_config=wvc.config.Configure.Generative.openai(),
)

Import the data

The example data is based on a set of ten questions from the "Jeopardy!" television program. Each of the data objects contains the following elements:

  • A vector embedding
  • A question
  • A category
  • An answer

You can use any vectorizer. In this example, OpenAI is the vectorizer. The question, answer, and category are the raw data. The vector embedding is the vector OpenAI returns after processing the raw data.

Create a JSON formatted data file to use in your import. In this example, the data import file is already prepared. The JSON file encodes this data:

View the dataset
CategoryQuestionAnswerVector
0SCIENCEThis organ removes excess glucose from the blood & stores it as glycogenLiver[ -0.006632288, -0.0042016874, ..., -0.020163147 ]
1ANIMALSIt's the only living mammal in the order ProboseideaElephant[ -0.0166891, -0.00092290324, ..., -0.032253385 ]
2ANIMALSThe gavial looks very much like a crocodile except for this bodily featurethe nose or snout[ -0.015592773, 0.019883318, ..., 0.0033349802 ]
3ANIMALSWeighing around a ton, the eland is the largest species of this animal in AfricaAntelope[ 0.014535263, -0.016103541, ..., -0.025882969 ]
4ANIMALSHeaviest of all poisonous snakes is this North American rattlesnakethe diamondback rattler[ -0.0030859283, 0.015239313, ..., -0.021798335 ]
5SCIENCE2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classificationspecies[ -0.0090561025, 0.011155112, ..., -0.023036297 ]
6SCIENCEA metal that is "ductile" can be pulled into this while cold & under pressurewire[ -0.02735741, 0.01199829, ..., 0.010396339 ]
7SCIENCEIn 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substanceDNA[ -0.014227471, 0.020493254, ..., -0.0027445166 ]
8SCIENCEChanges in the tropospheric layer of this are what gives us weatherthe atmosphere[ 0.009625228, 0.027518686, ..., -0.0068922946 ]
9SCIENCEIn 70-degree air, a plane traveling at about 1,130 feet per second breaks itSound barrier[ -0.0013459147, 0.0018580769, ..., -0.033439033 ]

This batch import code imports the question objects, including their vectors. Batch import is more efficient than importing individual objects. You should use batch imports when you import large amounts of data.

    import requests

fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json" # This file includes pre-generated vectors
url = f"https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}"
resp = requests.get(url)
data = json.loads(resp.text) # Load data

question_objs = list()
for i, d in enumerate(data):
question_objs.append(wvc.data.DataObject(
properties={
"answer": d["Answer"],
"question": d["Question"],
"category": d["Category"],
},
vector=d["vector"]
))

questions = client.collections.get("Question")
questions.data.insert_many(question_objs) # This uses batching under the hood

Summary

Weaviate provides API integrations with many model providers. These providers can vectorize your data at import time. However, if you already have vectorized data, you can import the vectors when you import the underlying data objects.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.