Skip to main content

Schema (collection definitions)

Overview

This tutorial will guide you through the process of defining a schema for your data, including commonly used settings and key considerations.

Prerequisites
  • (Recommended) Complete the Quickstart tutorial.
  • A Weaviate instance with an administrator API key.
  • Install your preferred Weaviate client library.

Schema: An Introduction

The database schema defines how data is stored, organized and retrieved in Weaviate.

Define a collection before importing data. Weaviate cannot import data if the collection is undefined. If auto-schema is enabled, Weaviate can infer missing elements and add them to the collection definition. However, it is a best practice to manually define as much of the schema as possible since manual definition gives you the most control.

Let's begin with a simple example before diving into the details.

Basic schema creation

This example creates a collection called Question. The collection has three properties, answer, question, and category. The definition specifies the text2vec-openai vectorizer and the generative-cohere module for RAG.

import weaviate
import weaviate.classes as wvc
import os

client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)

try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="category",
data_type=wvc.config.DataType.TEXT,
)
]
)

print(questions.config.get(simple=False))

finally:
client.close()

The returned configuration looks similar to this:

See the returned schema
{
"classes": [
{
"class": "Question",
"description": "Information from a Jeopardy! question",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": true
}
},
"properties": [
{
"dataType": [
"text"
],
"description": "The question",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "question",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"description": "The answer",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"description": "The category",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "category",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 32,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}
]
}

Although we only specified the collection name and properties, the returned schema includes much more information.

This is because Weaviate infers the schema based on the data and defaults. Each of these options can be specified manually at collection creation time.

FAQ: Are schemas mutable?

Yes, to an extent. There are no restrictions against adding new collections, or properties. However, not all settings are mutable within existing collections. For example, you can not change the vectorizer or the generative module. You can read more about this in the schema reference.

Schemas in detail

Conceptually, it may be useful to think of each Weaviate instance comprising of multiple collections, each of which is a set of objects that share a common structure.

For example, you might have a movie database with Movie and Actor collections, each with their own properties. Or you might have a news database with Article, Author and Publication collections.

Available settings

For the most part, each collection should be thought of as isolated from the others (in fact, they are!). Accordingly, they can be configured independently. Each collection has:

  • A set of properties specifying the object data structure.
  • Multi-tenancy settings.
  • Vectorizer and generative modules.
  • Index settings (for vector and inverted indexes).
  • Replication and sharding settings.

And depending on your needs, you might want to change any number of these.

Properties

Each property has a number of settings that can be configured, such as the dataType, tokenization, and vectorizePropertyName. You can read more about these in the schema reference.

So for example, you might specify a schema like the one below, with additional options for the question and answer properties:

import weaviate
import weaviate.classes as wvc
import os

client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)

try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=True, # Include the property name ("question") when vectorizing
tokenization=wvc.config.Tokenization.LOWERCASE # Use "lowecase" tokenization
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=False, # Skip the property name ("answer") when vectorizing
tokenization=wvc.config.Tokenization.WHITESPACE # Use "whitespace" tokenization
),
]
)

finally:
client.close()

Cross-references

This is also where you would specify cross-references, which are a special type of property that links to another collection.

Cross-references can be very useful for creating relationships between objects. For example, you might have a Movie collection with a withActor cross-reference property that points to the Actor collection. This will allow you to retrieve relevant actors for each movie.

However, cross-references can be costly in terms of performance. Use them sparingly. Additionally, cross-reference properties do not affect the object's vector. So if you want the related properties to be considered in a vector search, they should be included in the object's vectorized properties.

You can find examples of how to define and use cross-references here.

Vectorizer and generative modules

Each collection can be configured with a vectorizer and a generative module. The vectorizer is used to generate vectors for each object and also for any un-vectorized queries, and the generative module is used to perform retrieval augmented generation (RAG) queries.

These settings are currently immutable once the collection is created. Accordingly, you should choose the vectorizer and generative module carefully.

If you are not sure where to start, modules that integrate with popular API-based model providers such as Cohere or OpenAI are good starting points. You can find a list of available model integrations here.

Multi-tenancy settings

Starting from version v1.20.0, each collection can be configured as a multi-tenancy collection. This allows separation of data between tenants, typically end-users, at a much lower overhead than creating separate collections for each tenant.

This is useful if you want to use Weaviate as a backend for a multi-tenant (e.g. SaaS) application, or if data isolation is required for any other reason.

import weaviate
import weaviate.classes as wvc
import os

client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)

try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
),
],
multi_tenancy_config=wvc.config.Configure.multi_tenancy(enabled=True), # Enable multi-tenancy
)

finally:
client.close()

Index settings

Weaviate uses two types of indexes: vector indexes and inverted indexes. Vector indexes are used to store and organize vectors for fast vector similarity-based searches. Inverted indexes are used to store data for fast filtering and keyword searches.

The default vector index type is HNSW. The other options are flat, which is suitable for small collections, such as those in a multi-tenancy collection, or dynamic, which starts as a flat index before switching to an HNSW index if its size grows beyond a predetermined threshold.

import weaviate
import weaviate.classes as wvc
import os

client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)

try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
),
],
# Configure the vector index
vector_index_config=wvc.config.Configure.VectorIndex.hnsw( # Or `flat` or `dynamic`
distance_metric=wvc.config.VectorDistances.COSINE,
quantizer=wvc.config.Configure.VectorIndex.Quantizer.bq(),
),
# Configure the inverted index
inverted_index_config=wvc.config.Configure.inverted_index(
index_null_state=True,
index_property_length=True,
index_timestamps=True,
),
)

finally:
client.close()

Replication and sharding settings

Replication

Replication settings determine how many copies of the data are stored. For example, a replication setting of 3 means that each object is stored on 3 different replicas. This is important for providing redundancy and fault tolerance in production. (The default replication factor is 1.)

This goes hand-in-hand with consistency settings, which determine how many replicas must respond before an operation is considered successful.

We recommend that you read the concepts page on replication for information on how replication works in Weaviate. To specify a replication factor, follow this how-to.

Sharding

Sharding settings determine how each collection is sharded and distributed across nodes. This is not a setting that is typically changed, but you can use it to control how many shards are created in a cluster, and how many virtual shards are created per physical shard (read more here).

Notes

Collection & property names

Collection names always start with a capital letter. Properties always begin with a small letter. You can use PascalCase class names, and property names allow underscores. Read more here.

The following resources include more detailed information on schema settings and how to use them:

Questions and feedback

If you have any questions or feedback, let us know in the user forum.