Schema (collection definitions)
Overview
This tutorial will guide you through the process of defining a schema for your data, including commonly used settings and key considerations.
- (Recommended) Complete the Quickstart tutorial.
- A Weaviate instance with an administrator API key.
- Install your preferred Weaviate client library.
Schema: An Introduction
What is a schema?
The database schema defines how data is stored, organized and retrieved in Weaviate.
A schema must be defined before data can be imported. We generally recommend defining as much of the schema manually, although Weaviate can also infer the schema during import if auto-schema feature is enabled.
Let's begin with a simple example before diving into the details.
Basic schema creation
This example will create a simple collection called Question, with three properties (answer
, question
, and category
), the text2vec-openai
vectorizer and the generative-cohere
module for RAG. It then retrieves the schema and displays it.
The returned configuration should look something like this:
See the returned schema
Note: results will vary depending on your client library.
{
"classes": [
{
"class": "Question",
"description": "Information from a Jeopardy! question",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": true
}
},
"properties": [
{
"dataType": [
"text"
],
"description": "The question",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "question",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"description": "The answer",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"description": "The category",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "category",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 32,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}
]
}
Although we only specified the collection name and properties, the returned schema includes much more information.
This is because Weaviate infers the schema based on the data and defaults. Each of these options can be specified manually at collection creation time.
Yes, to an extent. There are no restrictions against adding new collections, or properties. However, not all settings are mutable within existing collections. For example, you can not change the vectorizer or the generative module. You can read more about this in the schema reference.
Schemas in detail
Conceptually, it may be useful to think of each Weaviate instance comprising of multiple collections, each of which is a set of objects that share a common structure.
For example, you might have a movie database with Movie
and Actor
collections, each with their own properties. Or you might have a news database with Article
, Author
and Publication
collections.
Available settings
For the most part, each collection should be thought of as isolated from the others (in fact, they are!). Accordingly, they can be configured independently. Each collection has:
- A set of
properties
specifying the object data structure. - Multi-tenancy settings.
- Vectorizer and generative modules.
- Index settings (for vector and inverted indexes).
- Replication and sharding settings.
And depending on your needs, you might want to change any number of these.
Properties
Each property has a number of settings that can be configured, such as the dataType
, tokenization
, and vectorizePropertyName
. You can read more about these in the schema reference.
So for example, you might specify a schema like the one below, with additional options for the question
and answer
properties:
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
import weaviate
import weaviate.classes as wvc
import os
client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)
try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=True, # Include the property name ("question") when vectorizing
tokenization=wvc.config.Tokenization.LOWERCASE # Use "lowecase" tokenization
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=False, # Skip the property name ("answer") when vectorizing
tokenization=wvc.config.Tokenization.WHITESPACE # Use "whitespace" tokenization
),
]
)
finally:
client.close()
import weaviate
import json
client = weaviate.Client("https://WEAVIATE_INSTANCE_URL/") # Replace with your Weaviate endpoint
# we will create the class "Question"
class_obj = {
"class": "Question",
"description": "Information from a Jeopardy! question", # description of the class
"vectorizer": "text2vec-openai",
"moduleConfig": {
"generative-openai": {} # Set `generative-openai` as the generative module
},
"properties": [
{
"name": "question",
"dataType": ["text"],
"description": "The question",
"moduleConfig": {
"text2vec-openai": { # this must match the vectorizer used
"vectorizePropertyName": True,
"tokenization": "lowercase"
}
}
},
{
"name": "answer",
"dataType": ["text"],
"description": "The answer",
"moduleConfig": {
"text2vec-openai": { # this must match the vectorizer used
"vectorizePropertyName": False,
"tokenization": "whitespace"
}
}
},
],
}
# add the schema
client.schema.create_class(class_obj)
import weaviate from 'weaviate-client';
const client: WeaviateClient = await weaviate.connectToWeaviateCloud(
'WEAVIATE_INSTANCE_URL', { // Replace WEAVIATE_INSTANCE_URL with your instance URL
authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_API_KEY'),
}
)
// Define the 'Question' collection
const collectionObj = {
name: 'Question',
properties: [
{
name: 'question',
dataType: 'text' as const,
description: 'Category of the question' as const,
tokenization: 'lowercase' as const,
vectorizePropertyName: true,
},
{
name: 'answer',
dataType: 'text' as const,
description: 'The question',
tokenization: 'whitespace' as const,
vectorizePropertyName: false,
}
],
vectorizers: weaviate.configure.vectorizer.text2VecOpenAI(),
generative: weaviate.configure.generative.openAI()
}
// Add the class to the schema
const newCollection = await client.collections.create(collectionObj)
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
});
// Define the 'Question' class
const classObj = {
class: 'Question',
description: 'Information from a Jeopardy! question', // description of the class
vectorizer: 'text2vec-openai',
moduleConfig: {
generative-openai: {} // Set `generative-openai` as the generative module
},
properties: [
{
name: 'question',
dataType: ['text'],
description: 'The question',
moduleConfig: {
'text2vec-openai': { // this must match the vectorizer used
vectorizePropertyName: true,
tokenization: 'lowercase' // Use "lowercase" tokenization
},
}
},
{
name: 'answer',
dataType: ['text'],
description: 'The answer',
moduleConfig: {
'text2vec-openai': { // this must match the vectorizer used
vectorizePropertyName: false,
tokenization: 'whitespace' // Use "whitespace" tokenization
},
}
},
]
};
// Add the class to the schema
await client
.schema
.classCreator()
.withClass(classObj)
.do();
Cross-references
This is also where you would specify cross-references, which are a special type of property that links to another collection.
Cross-references can be very useful for creating relationships between objects. For example, you might have a Movie
collection with a withActor
cross-reference property that points to the Actor
collection. This will allow you to retrieve relevant actors for each movie.
However, cross-references can be costly in terms of performance. Use them sparingly. Additionally, cross-reference properties do not affect the object's vector. So if you want the related properties to be considered in a vector search, they should be included in the object's vectorized properties.
You can find examples of how to define and use cross-references here.
Vectorizer and generative modules
Each collection can be configured with a vectorizer and a generative module. The vectorizer is used to generate vectors for each object and also for any un-vectorized queries, and the generative module is used to perform retrieval augmented generation (RAG) queries.
These settings are currently immutable once the collection is created. Accordingly, you should choose the vectorizer and generative module carefully.
If you are not sure where to start, modules that integrate with popular API-based model providers such as Cohere or OpenAI are good starting points. You can find a list of available vectorizer modules here and generative modules here.
Multi-tenancy settings
Starting from version v1.20.0
, each collection can be configured as a multi-tenancy collection. This allows separation of data between tenants, typically end-users, at a much lower overhead than creating separate collections for each tenant.
This is useful if you want to use Weaviate as a backend for a multi-tenant (e.g. SaaS) application, or if data isolation is required for any other reason.
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
import weaviate
import weaviate.classes as wvc
import os
client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)
try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
),
],
multi_tenancy_config=wvc.config.Configure.multi_tenancy(enabled=True), # Enable multi-tenancy
)
finally:
client.close()
import weaviate
import json
client = weaviate.Client("https://WEAVIATE_INSTANCE_URL/") # Replace with your Weaviate endpoint
# we will create the class "Question"
class_obj = {
"class": "Question",
"description": "Information from a Jeopardy! question", # description of the class
"vectorizer": "text2vec-openai",
"moduleConfig": {
"generative-openai": {} # Set `generative-openai` as the generative module
},
"properties": [
{
"name": "question",
"dataType": ["text"],
"description": "The question",
"moduleConfig": {
"text2vec-openai": { # this must match the vectorizer used
"vectorizePropertyName": True,
"tokenization": "lowercase"
}
}
},
{
"name": "answer",
"dataType": ["text"],
"description": "The answer",
"moduleConfig": {
"text2vec-openai": { # this must match the vectorizer used
"vectorizePropertyName": False,
"tokenization": "whitespace"
}
}
},
],
"multiTenancyConfig": {"enabled": True}, # Enable multi-tenancy
}
# add the schema
client.schema.create_class(class_obj)
import weaviate from 'weaviate-client';
const client: WeaviateClient = await weaviate.connectToWeaviateCloud(
'WEAVIATE_INSTANCE_URL', { // Replace WEAVIATE_INSTANCE_URL with your instance URL
authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_API_KEY'),
}
)
// Define the 'Question' class
const collectionObj = {
name: 'Question',
properties: [
{
name: 'question',
dataType: 'text' as const,
description: 'Category of the question',
tokenization: 'lowercase' as const,
vectorizePropertyName: true,
},
{
name: 'answer',
dataType: 'text' as const,
description: 'The question',
tokenization: 'whitespace' as const,
vectorizePropertyName: false,
}
],
vectorizers: weaviate.configure.vectorizer.text2VecOpenAI(),
generative: weaviate.configure.generative.openAI(),
multiTenancy: weaviate.configure.multiTenancy({enabled: true})
}
// Add the class to the schema
const newCollection = await client.collections.create(collectionObj)
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
});
// Define the 'Question' class
const classObj = {
class: 'Question',
description: 'Information from a Jeopardy! question', // description of the class
vectorizer: 'text2vec-openai',
moduleConfig: {
generative-openai: {} // Set `generative-openai` as the generative module
},
properties: [
{
name: 'question',
dataType: ['text'],
description: 'The question',
moduleConfig: {
'text2vec-openai': { // this must match the vectorizer used
vectorizePropertyName: true,
tokenization: 'lowercase' // Use "lowercase" tokenization
},
}
},
{
name: 'answer',
dataType: ['text'],
description: 'The answer',
moduleConfig: {
'text2vec-openai': { // this must match the vectorizer used
vectorizePropertyName: false,
tokenization: 'whitespace' // Use "whitespace" tokenization
},
}
},
],
multiTenancyConfig: { enabled: true } // Enable multi-tenancy
};
// Add the class to the schema
await client
.schema
.classCreator()
.withClass(classObj)
.do();
Index settings
Weaviate uses two types of indexes: vector indexes and inverted indexes. Vector indexes are used to store and organize vectors for fast vector similarity-based searches. Inverted indexes are used to store data for fast filtering and keyword searches.
The default vector index type is HNSW. The other options are flat, which is suitable for small collections, such as those in a multi-tenancy collection, or dynamic, which starts as a flat index before switching to an HNSW index if its size grows beyond a predetermined threshold.
- Python Client v4
- Python Client v3
- JS/TS Client v3
- JS/TS Client v2
import weaviate
import weaviate.classes as wvc
import os
client = weaviate.connect_to_local(
headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"] # Replace with your inference API key
}
)
try:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # Set the vectorizer to "text2vec-openai" to use the OpenAI API for vector-related operations
generative_config=wvc.config.Configure.Generative.cohere(), # Set the generative module to "generative-cohere" to use the Cohere API for RAG
properties=[
wvc.config.Property(
name="question",
data_type=wvc.config.DataType.TEXT,
),
wvc.config.Property(
name="answer",
data_type=wvc.config.DataType.TEXT,
),
],
# Configure the vector index
vector_index_config=wvc.config.Configure.VectorIndex.hnsw( # Or `flat` or `dynamic`
distance_metric=wvc.config.VectorDistances.COSINE,
quantizer=wvc.config.Configure.VectorIndex.Quantizer.bq(),
),
# Configure the inverted index
inverted_index_config=wvc.config.Configure.inverted_index(
index_null_state=True,
index_property_length=True,
index_timestamps=True,
),
)
finally:
client.close()
import weaviate
import json
client = weaviate.Client("https://WEAVIATE_INSTANCE_URL/") # Replace with your Weaviate endpoint
# we will create the class "Question"
class_obj = {
"class": "Question",
"description": "Information from a Jeopardy! question", # description of the class
"vectorizer": "text2vec-openai",
"moduleConfig": {
"generative-openai": {} # Set `generative-openai` as the generative module
},
"properties": [
{
"name": "question",
"dataType": ["text"],
"description": "The question",
"moduleConfig": {
"text2vec-openai": { # this must match the vectorizer used
"vectorizePropertyName": True,
"tokenization": "lowercase"
}
}
},
{
"name": "answer",
"dataType": ["text"],
"description": "The answer",
"moduleConfig": {
"text2vec-openai": { # this must match the vectorizer used
"vectorizePropertyName": False,
"tokenization": "whitespace"
}
}
},
],
# Configure the vector index
"vectorIndexType": "hnsw", # Or "flat" or "dynamic"
"vectorIndexConfig": {
"distance": "cosine",
"bq": {
"enabled": True,
},
},
# Configure the inverted index
"indexTimestamps": True,
"indexNullState": True,
"indexPropertyLength": True,
}
# add the schema
client.schema.create_class(class_obj)
import weaviate from 'weaviate-client';
import { vectorizer, generative, configure, dataType } from 'weaviate-client';
const client: WeaviateClient = await weaviate.connectToWeaviateCloud(
'WEAVIATE_INSTANCE_URL', { // Replace WEAVIATE_INSTANCE_URL with your instance URL
authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_API_KEY'),
}
)
// Define the 'Question' class
const collectionObj = {
name: 'Question',
properties: [
{
name: 'question',
dataType: 'text' as const,
description: 'Category of the question',
tokenization: 'lowercase' as const,
vectorizePropertyName: true,
},
{
name: 'answer',
dataType: 'text' as const,
description: 'The question',
tokenization: 'whitespace' as const,
vectorizePropertyName: false,
}
],
vectorizers: vectorizer.text2VecOpenAI({
vectorIndexConfig: configure.vectorIndex.hnsw({ // Or `flat` or `dynamic`
distanceMetric: 'cosine',
quantizer: configure.vectorIndex.quantizer.bq(),
})
}),
generative: generative.openAI(),
invertedIndex: configure.invertedIndex({
indexNullState: true,
indexPropertyLength: true,
indexTimestamps: true,
}),
}
// Add the class to the schema
const newCollection = await client.collections.create(collectionObj)
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'https',
host: 'WEAVIATE_INSTANCE_URL', // Replace with your Weaviate endpoint
});
// Define the 'Question' class
const classObj = {
class: 'Question',
description: 'Information from a Jeopardy! question', // description of the class
vectorizer: 'text2vec-openai',
moduleConfig: {
generative-openai: {} // Set `generative-openai` as the generative module
},
properties: [
{
name: 'question',
dataType: ['text'],
description: 'The question',
moduleConfig: {
'text2vec-openai': { // this must match the vectorizer used
vectorizePropertyName: true,
tokenization: 'lowercase' // Use "lowercase" tokenization
},
}
},
{
name: 'answer',
dataType: ['text'],
description: 'The answer',
moduleConfig: {
'text2vec-openai': { // this must match the vectorizer used
vectorizePropertyName: false,
tokenization: 'whitespace' // Use "whitespace" tokenization
},
}
},
],
// Configure the vector index
vectorIndexType: 'hnsw', // Or `flat` or `dynamic`
vectorIndexConfig: {
distance: 'cosine',
bq: {
enabled: True,
},
},
// Configure the inverted index
indexTimestamps: true,
indexNullState: true,
indexPropertyLength: true,
};
// Add the class to the schema
await client
.schema
.classCreator()
.withClass(classObj)
.do();
Replication and sharding settings
Replication
Replication settings determine how many copies of the data are stored. For example, a replication setting of 3 means that each object is stored on 3 different replicas. This is important for providing redundancy and fault tolerance in production. (The default replication factor is 1.)
This goes hand-in-hand with consistency settings, which determine how many replicas must respond before an operation is considered successful.
We recommend that you read the concepts page on replication for information on how replication works in Weaviate. To specify a replication factor, follow this how-to.
Sharding
Sharding settings determine how each collection is sharded and distributed across nodes. This is not a setting that is typically changed, but you can use it to control how many shards are created in a cluster, and how many virtual shards are created per physical shard (read more here).
Notes
Collection & property names
Collection names always start with a capital letter. Properties always begin with a small letter. You can use PascalCase
class names, and property names allow underscores. Read more here.
Related resources
The following resources include more detailed information on schema settings and how to use them:
- Schema - Reference: Configuration: A reference of all available schema settings.
- Collections - How-to: manage data: Code examples for creating and managing collections, including how to configure various settings using client libraries.
- Schema - Reference: REST: A reference of all available schema settings for the REST API.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.