Data structure

Data object concepts

Each data object in Weaviate belongs to a collection and has one or more properties.

Weaviate stores data objects in class-based collections. Data objects are represented as JSON-documents. Objects normally include a vector that is derived from a machine learning model. The vector is also called an embedding or a vector embedding.

Each collection contains objects of the same class. The objects are defined by a common schema.

Capitalization

Weaviate follows GraphQL naming conventions.

Start collection names with an upper case letter.
Start property names with a lower case letter.

If you use an initial upper case letter to define a property name, Weaviate changes it to a lower case letter internally.

JSON documents as objects

Imagine we need to store information about an author named Alice Munro. In JSON format the data looks like this:

{
    "name": "Alice Munro",
    "age": 91,
    "born": "1931-07-10T00:00:00.0Z",
    "wonNobelPrize": true,
    "description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
}

Vectors

You can also attach vector representations to your data objects. Vectors are arrays of numbers that are stored under the "vector" property.

In this example, the Alice Munro data object has a small vector. The vector is some information about Alice, maybe a story or an image, that a machine learning model has transformed into an array of numerical values.

{
    "id": "779c8970-0594-301c-bff5-d12907414002",
    "class": "Author",
    "properties": {
        "name": "Alice Munro",
        (...)
    },
    "vector": [
        -0.16147631,
        -0.065765485,
        -0.06546908
    ]
}

To generate vectors for your data, use one of Weaviate's vectorizer modules. You can also use your own vectorizer.

Collections

Collections are groups of objects that share a schema definition.

In this example, the Author collection holds objects that represent different authors.

The collection looks like this:

[{
    "id": "dedd462a-23c8-32d0-9412-6fcf9c1e8149",
    "class": "Author",
    "properties": {
        "name": "Alice Munro",
        "age": 91,
        "born": "1931-07-10T00:00:00.0Z",
        "wonNobelPrize": true,
        "description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
    },
    "vector": [
        -0.16147631,
        -0.065765485,
        -0.06546908
    ]
}, {
    "id": "779c8970-0594-301c-bff5-d12907414002",
    "class": "Author",
    "properties": {
        "name": "Paul Krugman",
        "age": 69,
        "born": "1953-02-28T00:00:00.0Z",
        "wonNobelPrize": true,
        "description": "Paul Robin Krugman is an American economist and public intellectual, who is Distinguished Professor of Economics at the Graduate Center of the City University of New York, and a columnist for The New York Times. In 2008, Krugman was the winner of the Nobel Memorial Prize in Economic Sciences for his contributions to New Trade Theory and New Economic Geography."
    },
    "vector": [
        -0.93070928,
        -0.03782172,
        -0.56288009
    ]
}]

Every collection has its own vector space. This means that different collections can have different embeddings of the same object.

UUIDs

Every object stored in Weaviate has a UUID. The UUID guarantees uniqueness across all collections.

You can use a deterministic UUID to ensure that the same object always has the same UUID. This is useful when you want to update an object without changing its UUID.

If you don't specify an ID, Weaviate generates a random UUID for you.

In requests without any other ordering specified, Weaviate processes them in ascending UUID order. This means that requests to list objects, use of the cursor API, or requests to delete objects, without any other ordering specified, will be processed in ascending UUID order.

Cross-references

Cross-references and query performance

Queries involving cross-references can be slower than queries that do not involve cross-references, especially at scale such as for multiple objects or complex queries.

At the first instance, we strongly encourage you to consider whether you can avoid using cross-references in your data schema. As a scalable AI-native database, Weaviate is well-placed to perform complex queries with vector, keyword and hybrid searches involving filters. You may benefit from rethinking your data schema to avoid cross-references where possible.

For example, instead of creating separate "Author" and "Book" collections with cross-references, consider embedding author information directly in Book objects and using searches and filters to find books by author characteristics.

If data objects are related, you can use cross-references to represent the relationships. Cross-references in Weaviate are like links that help you retrieve related information. Cross-references capture relationships, but they do not change the vectors of the underlying objects.

To create a reference, use a property from one collection to specify the value of a related property in the other collection.

Cross-reference example

For example, "Paul Krugman writes for the New York Times" describes a relationship between Paul Krugman and the New York Times. To capture that relationship, create a cross-reference between the Publication object that represents the New York Times and the Author object that represents Paul Krugman.

The New York Times Publication object looks like this. Note the UUID in the "id" field:

{
    "id": "32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
    "class": "Publication",
    "properties": {
        "name": "The New York Times"
    },
    "vector": [...]
}

The Paul Krugman Author object adds a new property, writesFor, to capture the relationship.

{
    "id": "779c8970-0594-301c-bff5-d12907414002",
    "class": "Author",
    "properties": {
        "name": "Paul Krugman",
        ...
        "writesFor": [
            {
                "beacon": "weaviate://localhost/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
                "href": "/v1/objects/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6"
            }
        ],
    },
    "vector": [...]
}

The value of the beacon sub-property is the id value from the New York Times Publication object.

Cross-reference relationships are directional. To make the link bi-directional, update the Publication collection to add a hasAuthors property points back to the Author collection.

Multiple vector embeddings (named vectors)

Added in v1.24.0

Weaviate collections support multiple named vectors.

Collections can have multiple named vectors.

The vectors in a collection can have their own configurations. Each vector space can set its own index, its own compression algorithm, and its own vectorizer. This means you can use different vectorization models, and apply different distance metrics, to the same object.

To work with named vectors, adjust your queries to specify a target vector for vector search or hybrid search queries.

Data Schema

Weaviate requires a data schema before you add data. However, you don't have to create a data schema manually. If you don't provide one, Weaviate generates a schema based on the incoming data.

Weaviate's schema defines its data structure in a formal language. In other words, it is a blueprint of how to organize and store the data.

The schema defines data classes (i.e. collections of objects), the properties within each class (name, type, description, settings), possible graph links between data objects (cross-references), and the vectorizer module (if any) to be used for the class, as well as settings such as the vectorizer module, and index configurations.

Schema vs. Taxonomy

A Weaviate data schema is slightly different from a taxonomy. A taxonomy has a hierarchy. Read more about how taxonomies, ontologies and schemas are related in this Weaviate blog post.

Schemas fulfill several roles:

Schemas define collections and properties.
Schemas define cross-references that link collections, even collections that use different embeddings.
Schemas let you configure module behavior, ANN index settings, reverse indexes, and other features on a collection level.

For details on configuring your schema, see the schema tutorial or schema configuration.

Multi-tenancy

Multi-tenancy availability

Multi-tenancy added in v1.20

To separate data within a cluster, use multi-tenancy. Weaviate partitions the cluster into shards. Each shard holds data for a single tenant.

Sharding has several benefits:

Data isolation
Fast, efficient querying
Easy and robust setup and clean up

Tenant shards are more lightweight. You can easily have 50,000, or more, active shards per node. This means that you can support 1M concurrently active tenants with just 20 or so nodes.

Multi-tenancy is especially useful when you want to store data for multiple customers, or when you want to store data for multiple projects.

Tenant deletion == Tenant data deletion

Deleting a tenant deletes the associated shard. As a result, deleting a tenant also deletes all of its objects.

Tenant states

Multi-tenancy availability

Tenant activity status setting added in v1.21
OFFLOADED status added in v1.26

Tenants have an activity status (also called a tenant state) that reflects their availability and storage location. A tenant can be ACTIVE, INACTIVE, OFFLOADED, OFFLOADING, or ONLOADING.

ACTIVE tenants are loaded and available for read and write operations.
In all other states, the tenant is not available for read or write access. Access attempts return an error message.
- INACTIVE tenants are stored on local disk storage for quick activation.
- OFFLOADED tenants are stored on cloud storage. This status is useful for long-term storage for tenants that are not frequently accessed.
- OFFLOADING tenants are being moved to cloud storage. This is a transient status, and therefore not user-specifiable.
- ONLOADING tenants are being loaded from cloud storage. This is a transient status, and therefore not user-specifiable. An ONLOADING tenant may be being warmed to a ACTIVE status or a INACTIVE status.

For more details on managing tenants, see Multi-tenancy operations.

Status	Available	Description	User-specifiable
`ACTIVE`	Yes	Loaded and available for read/write operations.	Yes
`INACTIVE`	No	On local disk storage, no read / write access. Access attempts return an error message.	Yes
`OFFLOADED`	No	On cloud storage, no read / write access. Access attempts return an error message.	Yes
`OFFLOADING`	No	Being moved to cloud storage, no read / write access. Access attempts return an error message.	No
`ONLOADING`	No	Being loaded from cloud storage, no read / write access. Access attempts return an error message.	No

Tenant status renamed in v1.26

In v1.26, the HOT status was renamed to ACTIVE and the COLD status was renamed to INACTIVE.

Tenant state propagation

A tenant state change may take some time to propagate across a cluster, especially a multi-node cluster.

For example, data may not be immediately available after reactivating an offloaded tenant. Similarly, data may not be immediately unavailable after offloading a tenant. This is because the tenant states are eventually consistent, and the change must be propagated to all nodes in the cluster.

Offloaded tenants

Added in v1.26.0

Offloading: AWS S3 only

As of Weaviate v1.26.0, tenants can only be offloaded to cold storage in AWS S3. Additional storage options may be added in future releases.

To offload a tenant, use the offload-s3 module.

Offloading tenants requires the relevant offload-<storage> module to be enabled in the Weaviate cluster.

When a tenant is offloaded, the entire tenant shard is moved to cloud storage. This is useful for long-term storage of tenants that are not frequently accessed. Offloaded tenants are not available for read or write operations until they are loaded back into the cluster.

Backups

Backups do not include inactive or offloaded tenants

Backups of multi-tenant collections will only include active tenants, and not inactive or offloaded tenants. Activate tenants before creating a backup to ensure all data is included.

Tenancy and IDs

Each tenancy is like a namespace, so different tenants could, in theory, have objects with the same IDs. To avoid naming problems, object IDs in multi-tenant clusters combine the tenant ID and the object ID to create an ID that is unique across tenants.

Tenancy and cross-references

Multi-tenancy supports some cross-references.

Cross-references like these are supported:

From a multi-tenancy object to a non-multi-tenancy object.
From a multi-tenancy object to another multi-tenancy object, as long as they belong to the same tenant.

Cross-references like these are not supported:

From a non-multi-tenancy object to a multi-tenancy object.
From a multi-tenancy object to another multi-tenancy object if they belong to different tenants.

Key features

Each tenant has a dedicated, high-performance vector index. Dedicated indexes mean faster query speeds. Instead of searching a shared index space, each tenant responds as if it was the only user on the cluster.
Each tenant's data is isolated on a dedicated shard. This means that deletes are fast and do not affect other tenants.
To scale out, add a new node to your cluster. Weaviate does not redistribute existing tenants, however Weaviate adds new tenants to the node with the least resource usage.

Monitoring metrics

To group tenants together for monitoring, set PROMETHEUS_MONITORING_GROUP = true in your system configuration file.

Number of tenants per node

The number of tenants per node is limited by operating system constraints. The number of tenants cannot exceed the Linux open file limit per process.

For example, a 9-node test cluster built on n1-standard-8 machines holds around 170k active tenants. There are 18,000 to 19,000 tenants per node.

Note that these numbers relate to active tenants only. If you set unused tenants as inactive, the open file per process limit does not apply.

For more information, see the following:

Summary

The schema defines collections and properties.
Collections contain data objects that are describe in JSON documents.
Data objects can contain a vector and properties.
Vectors come from machine learning models.
Different collections represent different vector spaces.
Cross-references link objects between schemas.
Multi-tenancy isolates data for each tenant.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Data object concepts​

JSON documents as objects​

Vectors​

Collections​

UUIDs​

Cross-references​

Cross-reference example​

Multiple vector embeddings (named vectors)​

Data Schema​

Multi-tenancy​

Tenant states​

Offloaded tenants​

Backups​

Tenancy and IDs​

Tenancy and cross-references​

Key features​

Monitoring metrics​

Number of tenants per node​

Related pages​

Summary​

Questions and feedback​