Skip to main content

Schemas in detail

LICENSE Weaviate on Stackoverflow badge Weaviate issues on GitHub badge Weaviate version badge Weaviate total Docker pulls badge Go Report Card

Overview

In this section, we will explore schema construction, including discussing some of the more commonly specified parameters. We will also discuss the auto-schema feature and why you might want to take the time to manually set the schema.

Prerequisites

We recommend you complete the Quickstart tutorial first.

Before you start this tutorial, you should follow the steps in the tutorials to have:

  • A new instance of Weaviate running (e.g. on the Weaviate Cloud Services),
  • An API key for your preferred inference API, such as OpenAI, Cohere, or Hugging Face, and
  • Installed your preferred Weaviate client library.

If you have completed the entire Quickstart tutorial, your Weaviate instance will contain data objects and a schema. We recommend deleting the Question class before starting this section. See below for details on how to do so:

Deleting classes

You can delete any unwanted class(es), along with the data that they contain.

Deleting a class == Deleting its objects

Know that deleting a class will also delete all associated objects!

Do not do this to a production database, or anywhere where you do not wish to delete your data.

Run the code below to delete the relevant class and its objects.

# delete class "YourClassName" - THIS WILL DELETE ALL DATA IN THIS CLASS
client.schema.delete_class("YourClassName") # Replace with your class name - e.g. "Question"

Introduction

What is a schema?

Weaviate's schema defines its data structure in a formal language. In other words, it is a blueprint of how the data is to be organized and stored.

The schema defines data classes (i.e. collections of objects), the properties within each class (name, type, description, settings), possible graph links between data objects (cross-references), and the vectorizer module (if any) to be used for the class, as well as settings such as the vectorizer module, and index configurations.

Quickstart recap

In the Quickstart tutorial, you saw how to specify the name and the vectorizer for a data collection, called a "class" in Weaviate:

class_obj = {
"class": "Question",
"vectorizer": "text2vec-openai", # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
"moduleConfig": {
"text2vec-openai": {},
"generative-openai": {} # Ensure the `generative-openai` module is used for generative queries
}
}

client.schema.create_class(class_obj)

Then when you navigated to the schema endpoint at https://some-endpoint.weaviate.network/v1/schema, you will have seen the above-specified class name and the vectorizer.

But you might have also noticed that the schema included a whole lot of information that you did not specify.

That's because Weaviate inferred them for us, using the "auto-schema" feature.

Auto-schema vs. manual schema

Weaviate requires a complete schema for each class of data objects.

If any required information is missing, Weaviate will use the auto-schema feature to fill in infer the rest from the data being imported as well as the default settings.

While this may be suitable in some circumstances, in many cases you may wish to explicitly define a schema. Manually defining the schema will help you ensure that the schema is suited for your specific data and needs.

Create a class

A collection of data in Weaviate is called a "class". We will be adding a class to store our quiz data.

About classes

Here are some key considerations about classes:

Each Weaviate class:

  • Is always written with a capital letter first. This is to distinguish them from generic names for cross-referencing.
  • Constitutes a distinct vector space. A search in Weaviate is always restricted to a class.
  • Can have its own vectorizer. (e.g. one class can have a text2vec-openai vectorizer, and another might have multi2vec-clip vectorizer, or none if you do not intend on using a vectorizer).
  • Has property values, where each property specifies the data type to store.
Can I specify my own vectors?

Yes! You can bring your own vectors and pass them to Weaviate directly. See this reference for more information.

Create a basic class

Let's create a class called Question for our data.

Our Question class will:

  • Contain three properties:
    • name answer: type text
    • name question: type text
    • name category: type text
  • Use a text2vec-openai vectorizer

Run the below code with your client to define the schema for the Question class and display the created schema information.

import weaviate
import json

client = weaviate.Client("https://some-endpoint.weaviate.network/") # Replace with your endpoint

# we will create the class "Question"
class_obj = {
"class": "Question",
"description": "Information from a Jeopardy! question", # description of the class
"properties": [
{
"dataType": ["text"],
"description": "The question",
"name": "question",
},
{
"dataType": ["text"],
"description": "The answer",
"name": "answer",
},
{
"dataType": ["text"],
"description": "The category",
"name": "category",
},
],
"vectorizer": "text2vec-openai",
}

# add the schema
client.schema.create_class(class_obj)

# get the schema
schema = client.schema.get()

# print the schema
print(json.dumps(schema, indent=4))
Classes and Properties - best practice

Classes always start with a capital letter. Properties always begin with a small letter. You can use CamelCase class names, and property names allow underscores. Read more about schema classes, properties and data types here.

The result should look something like this:

See the returned schema
{
"classes": [
{
"class": "Question",
"description": "Information from a Jeopardy! question",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": true
}
},
"properties": [
{
"dataType": [
"text"
],
"description": "The question",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "question",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"description": "The answer",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"description": "The category",
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "category",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}
]
}

We get back a lot of information here.

Some of it is what we specified, such as the class name (class), and properties including their dataType and name. But the others are inferred by Weaviate based on the defaults and the data provided.

Class property specification examples

And depending on your needs, you might want to change any number of these. For example, you might change:

  • dataType to modify the type of data being saved. For example, classes with dataType text will be tokenized differently to those with string dataType (read more).
  • moduleConfig to modify how each module behaves. In this case, you could change the model and/or version for the OpenAI inference API, and the vectorization behavior such as whether the class name is used for vectorization.
  • properties / moduleConfig to further modify module behavior at a class data property level. You might choose to skip a particular property being included for vectorization.
  • invertedIndexConfig to add or remove particular stopwords, or change BM25 indexing constants.
  • vectorIndexConfig to change vector index (e.g. HNSW) parameters, such as for speed / recall tradeoffs.

So for example, you might specify a schema like the one below:

{
"class": "Question",
"description": "Information from a Jeopardy! question",
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": false // Default: true
}
},
"invertedIndexConfig": {
"bm25": {
"k1": 1.5, // Default: 1.2
"b": 0.75
}
},
"properties": [
{
"dataType": ["text"],
"description": "The question",
"moduleConfig": {
"text2vec-openai": {
"vectorizePropertyName": true // Default: false
}
},
"name": "question",
},
...
]
}

With this you will have changed the specified properties from their defaults. Note that in the rest of the tutorials, we assume that you have not done this.

You can read more about various schema, data types, modules, and index configuration options in the pages below.

Recap

  • The schema is where you define the structure of the information to be saved.
  • A schema consists of classes and properties, which define concepts.
  • Any unspecified setting is inferred by the auto-schema feature based on the data and defaults.
  • The schema can be modified through the RESTful API.
  • A class or property in Weaviate is immutable, but can always be extended.

Suggested reading

More Resources

For additional information, try these sources.

  1. Frequently Asked Questions
  2. Weaviate Community Forum
  3. Knowledge base of old issues
  4. Stackoverflow
  5. Weaviate slack channel