Skip to main content

Data structure

Data object nomenclature

Each data object in Weaviate belongs to a collection and has one or more properties.

Weaviate stores data objects in class-based collections. Data objects are represented as JSON-documents. Objects normally include a vector that is derived from a machine learning model. The vector is also called an embedding or a vector embedding.

Each collection contains objects of the same class. The objects are defined by a common schema.

Capitalization

Weaviate follows GraphQL naming conventions.

  • Start collection names with an upper case letter.
  • Start property names with a lower case letter.

If you use an initial upper case letter to define a property name, Weaviate changes it to a lower case letter internally.

JSON documents as objects

Imagine we need to store information about an author named Alice Munro. In JSON format the data looks like this:

{
"name": "Alice Munro",
"age": 91,
"born": "1931-07-10T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
}

Vectors

You can also attach vector representations to your data objects. Vectors are arrays of numbers that are stored under the "vector" property.

In this example, the Alice Munro data object has a small vector. The vector is some information about Alice, maybe a story or an image, that a machine learning model has transformed into an array of numerical values.

{
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Alice Munro",
(...)
},
"vector": [
-0.16147631,
-0.065765485,
-0.06546908
]
}

To generate vectors for your data, use one of Weaviate's vectorizer modules. You can also use your own vectorizer.

Collections

Collections are groups of objects that share a schema definition.

In this example, the Author collection holds objects that represent different authors.

The collection looks like this:

[{
"id": "dedd462a-23c8-32d0-9412-6fcf9c1e8149",
"class": "Author",
"properties": {
"name": "Alice Munro",
"age": 91,
"born": "1931-07-10T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
},
"vector": [
-0.16147631,
-0.065765485,
-0.06546908
]
}, {
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Paul Krugman",
"age": 69,
"born": "1953-02-28T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Paul Robin Krugman is an American economist and public intellectual, who is Distinguished Professor of Economics at the Graduate Center of the City University of New York, and a columnist for The New York Times. In 2008, Krugman was the winner of the Nobel Memorial Prize in Economic Sciences for his contributions to New Trade Theory and New Economic Geography."
},
"vector": [
-0.93070928,
-0.03782172,
-0.56288009
]
}]

Every collection has its own vector space. This means that different collections can have different embeddings of the same object.

tip

Every object stored in Weaviate has a UUID. The UUID guarantees uniqueness across all collections.

Cross-references

If data objects are related, use cross-references to represent the relationships. Cross-references in Weaviate are like links that help you retrieve related information. Cross-references capture relationships, but they do not change the vectors of the underlying objects.

To create a reference, use a property from one collection to specify the value of a related property in the other collection.

Cross-reference example

For example, "Paul Krugman writes for the New York Times" describes a relationship between Paul Krugman and the New York Times. To capture that relationship, create a cross-reference between the Publication object that represents the New York Times and the Author object that represents Paul Krugman.

The New York Times Publication object looks like this. Note the UUID in the "id" field:

{
"id": "32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
"class": "Publication",
"properties": {
"name": "The New York Times"
},
"vector": [...]
}

The Paul Krugman Author object adds a new property, writesFor, to capture the relationship.

{
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Paul Krugman",
...
"writesFor": [
{
"beacon": "weaviate://localhost/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
"href": "/v1/objects/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6"
}
],
},
"vector": [...]
}

The value of the beacon sub-property is the id value from the New York Times Publication object.

Cross-reference relationships are directional. To make the link bi-directional, update the Publication collection to add a `hasAuthors property points back to the Author collection.

Multiple vectors

Added in v1.24.0

Weaviate collections support multiple, named vectors.

Collections can have multiple, named vectors. Each vector is independent. Each vector space has its own index, its own compression, and its own vectorizer. This means you can create vectors for properties, use different vectorization models, and apply different metrics to the same object.

You do not have to use multiple vectors in your collections, but if you do, you need to adjust your queries to specify a target vector for vector or hybrid queries.

Weaviate Schema

Weaviate requires a data schema before you add data. However, you don't have to create a data schema manually. If you don't provide one, Weaviate generates a schema based on the incoming data.

Weaviate's schema defines its data structure in a formal language. In other words, it is a blueprint of how to organize and store the data.

The schema defines data classes (i.e. collections of objects), the properties within each class (name, type, description, settings), possible graph links between data objects (cross-references), and the vectorizer module (if any) to be used for the class, as well as settings such as the vectorizer module, and index configurations.

Schema vs. Taxonomy

A Weaviate data schema is slightly different from a taxonomy. A taxonomy has a hierarchy. Read more about how taxonomies, ontologies and schemas are related in this Weaviate blog post.

Schemas fulfill several roles:

  1. Schemas define collections and properties.
  2. Schemas define cross-references that link collections, even collections that use different embeddings.
  3. Schemas let you configure module behavior, ANN index settings, reverse indexes, and other features on a collection level.

For details on configuring your schema, see the schema tutorial or schema configuration.

Multi-tenancy

Multi-tenancy availability
  • Multi-tenancy added in v1.20
  • The tenant activity status setting added in v1.21

To separate data within a cluster, use multi-tenancy. Weaviate partitions the cluster into shards. Each shard holds data for a single tenant.

Sharding has several benefits:

  • Data isolation
  • Fast, efficient querying
  • Easy and robust setup and clean up

Starting in v1.20, shards are more lightweight. You can easily have 50,000, or more, active shards per node. This means that you can support 1M concurrently active tenants with just 20 or so nodes.

Starting in v1.20.1, you can specify tenants as active (HOT) or inactive (COLD). For more details on managing tenants, see Multi-tenancy operations.

Multi-tenancy is especially useful when you want to store data for multiple customers, or when you want to store data for multiple projects.

Tenancy and IDs

Each tenancy is like a namespace, so different tenants could, in theory, have objects with the same IDs. To avoid naming problems, object IDs in multi-tenant clusters combine the tenant ID and the object ID to create an ID that is unique across tenants.

Tenancy and cross-references

Multi-tenancy supports some cross-references.

Cross-references like these are supported:

  • From a multi-tenancy object to a non-multi-tenancy object.
  • From a multi-tenancy object to another multi-tenancy object, as long as they belong to the same tenant.

Cross-references like these are not supported:

  • From a non-multi-tenancy object to a multi-tenancy object.
  • From a multi-tenancy object to another multi-tenancy object if they belong to different tenants.

Key features

  • Each tenant has a dedicated, high-performance vector index. Dedicated indexes mean faster query speeds. Instead of searching a shared index space, each tenant responds as if it was the only user on the cluster.
  • Each tenant's data is isolated on a dedicated shard. This means that deletes are fast and do not affect other tenants.
  • To scale out, add a new node to your cluster. Weaviate does not redistribute existing tenants, however Weaviate adds new tenants to the node with the least resource usage.

Monitoring metrics

To group tenants together for monitoring, set PROMETHEUS_MONITORING_GROUP = true in your system configuration file.

Number of tenants per node

The number of tenants per node is limited by operating system constraints. The number of tenants cannot exceed the Linux open file limit per process.

For example, a 9-node test cluster built on n1-standard-8 machines holds around 170k active tenants. There are 18,000 to 19,000 tenants per node.

Note that these numbers relate to active tenants only. If you set unused tenants as inactive, the open file per process limit does not apply.

Lazy shard loading

Added in v1.23

When Weaviate starts, it loads data from all of the shards in your deployment. This process can take a long time. Prior to v1.23, you have to wait until all of the shards are loaded before you can query your data. Since every tenant is a shard, multi-tenant deployments can have reduced availability after a restart.

Lazy shard loading allows you to start working with your data sooner. After a restart, shards load in the background. If the shard you want to query is already loaded, you can get your results sooner. If the shard is not loaded yet, Weaviate prioritizes loading that shard and returns a response when it is ready.

To enable lazy shard loading, set DISABLE_LAZY_LOAD_SHARDS = false in your system configuration file.

Tenant status

Added in v1.21

Tenants are HOT or COLD. Tenant status determines if Weaviate can access the shard.

StatusStateDescription
HOTActiveWeaviate can read and write.
COLDInactiveWeaviate cannot read or write. Access attempts return an error message.

For more information, see the following:

Summary

  • The schema defines collections and properties.
  • Collections contain data objects that are describe in JSON documents.
  • Data objects can contain a vector and properties.
  • Vectors come from machine learning models.
  • Different collections represent different vector spaces.
  • Cross-references link objects between schemas.
  • Multi-tenancy isolates data for each tenant.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.