Data structure
Data object concepts
Each data object in Weaviate belongs to a collection
and has one or more properties
.
Weaviate stores data objects
in class-based collections. Data objects are represented as JSON-documents. Objects normally include a vector
that is derived from a machine learning model. The vector is also called an embedding
or a vector embedding
.
Each collection contains objects of the same class
. The objects are defined by a common schema
.
Weaviate follows GraphQL naming conventions.
- Start collection names with an upper case letter.
- Start property names with a lower case letter.
If you use an initial upper case letter to define a property name, Weaviate changes it to a lower case letter internally.
JSON documents as objects
Imagine we need to store information about an author named Alice Munro. In JSON format the data looks like this:
{
"name": "Alice Munro",
"age": 91,
"born": "1931-07-10T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
}
Vectors
You can also attach vector
representations to your data objects. Vectors are arrays of numbers that are stored under the "vector"
property.
In this example, the Alice Munro
data object has a small vector. The vector is some information about Alice, maybe a story or an image, that a machine learning model has transformed into an array of numerical values.
{
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Alice Munro",
(...)
},
"vector": [
-0.16147631,
-0.065765485,
-0.06546908
]
}
To generate vectors for your data, use one of Weaviate's vectorizer modules. You can also use your own vectorizer.
Collections
Collections are groups of objects that share a schema definition.
In this example, the Author
collection holds objects that represent different authors.
The collection looks like this:
[{
"id": "dedd462a-23c8-32d0-9412-6fcf9c1e8149",
"class": "Author",
"properties": {
"name": "Alice Munro",
"age": 91,
"born": "1931-07-10T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
},
"vector": [
-0.16147631,
-0.065765485,
-0.06546908
]
}, {
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Paul Krugman",
"age": 69,
"born": "1953-02-28T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Paul Robin Krugman is an American economist and public intellectual, who is Distinguished Professor of Economics at the Graduate Center of the City University of New York, and a columnist for The New York Times. In 2008, Krugman was the winner of the Nobel Memorial Prize in Economic Sciences for his contributions to New Trade Theory and New Economic Geography."
},
"vector": [
-0.93070928,
-0.03782172,
-0.56288009
]
}]
Every collection has its own vector space. This means that different collections can have different embeddings of the same object.
UUIDs
Every object stored in Weaviate has a UUID. The UUID guarantees uniqueness across all collections.
You can use a deterministic UUID to ensure that the same object always has the same UUID. This is useful when you want to update an object without changing its UUID.
If you don't specify an ID, Weaviate generates a random UUID for you.
In requests without any other ordering specified, Weaviate processes them in ascending UUID order. This means that requests to list objects, use of the cursor API, or requests to delete objects, without any other ordering specified, will be processed in ascending UUID order.
Cross-references
If data objects are related, use cross-references to represent the relationships. Cross-references in Weaviate are like links that help you retrieve related information. Cross-references capture relationships, but they do not change the vectors of the underlying objects.
To create a reference, use a property from one collection to specify the value of a related property in the other collection.
Cross-reference example
For example, "Paul Krugman writes for the New York Times" describes a relationship between Paul Krugman and the New York Times. To capture that relationship, create a cross-reference between the Publication
object that represents the New York Times and the Author
object that represents Paul Krugman.
The New York Times Publication
object looks like this. Note the UUID in the "id"
field:
{
"id": "32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
"class": "Publication",
"properties": {
"name": "The New York Times"
},
"vector": [...]
}
The Paul Krugman Author
object adds a new property, writesFor
, to capture the relationship.
{
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Paul Krugman",
...
"writesFor": [
{
"beacon": "weaviate://localhost/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
"href": "/v1/objects/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6"
}
],
},
"vector": [...]
}
The value of the beacon
sub-property is the id
value from the New York Times Publication
object.
Cross-reference relationships are directional. To make the link bi-directional, update the Publication
collection to add a `hasAuthors
property points back to the Author
collection.
Multiple vectors (named vectors)
Weaviate collections support multiple named vectors.
Collections can have multiple named vectors.
The vectors in a collection can have their own configurations. Each vector space can set its own index, its own compression algorithm, and its own vectorizer. This means you can use different vectorization models, and apply different distance metrics, to the same object.
To work with named vectors, adjust your queries to specify a target vector for vector search or hybrid search queries.
Data Schema
Weaviate requires a data schema before you add data. However, you don't have to create a data schema manually. If you don't provide one, Weaviate generates a schema based on the incoming data.
Weaviate's schema defines its data structure in a formal language. In other words, it is a blueprint of how to organize and store the data.
The schema defines data classes (i.e. collections of objects), the properties within each class (name, type, description, settings), possible graph links between data objects (cross-references), and the vectorizer module (if any) to be used for the class, as well as settings such as the vectorizer module, and index configurations.
A Weaviate data schema is slightly different from a taxonomy. A taxonomy has a hierarchy. Read more about how taxonomies, ontologies and schemas are related in this Weaviate blog post.
Schemas fulfill several roles:
- Schemas define collections and properties.
- Schemas define cross-references that link collections, even collections that use different embeddings.
- Schemas let you configure module behavior, ANN index settings, reverse indexes, and other features on a collection level.
For details on configuring your schema, see the schema tutorial or schema configuration.
Multi-tenancy
- Multi-tenancy added in
v1.20
To separate data within a cluster, use multi-tenancy. Weaviate partitions the cluster into shards. Each shard holds data for a single tenant.
Sharding has several benefits:
- Data isolation
- Fast, efficient querying
- Easy and robust setup and clean up
Tenant shards are more lightweight. You can easily have 50,000, or more, active shards per node. This means that you can support 1M concurrently active tenants with just 20 or so nodes.
Multi-tenancy is especially useful when you want to store data for multiple customers, or when you want to store data for multiple projects.
Deleting a tenant deletes the associated shard. As a result, deleting a tenant also deletes all of its objects.
Tenant states
- Tenant activity status setting added in
v1.21
OFFLOADED
status added inv1.26
Tenants have an activity status (also called a tenant state) that reflects their availability and storage location. A tenant can be ACTIVE
, INACTIVE
, OFFLOADED
, OFFLOADING
, or ONLOADING
.
ACTIVE
tenants are loaded and available for read and write operations.- In all other states, the tenant is not available for read or write access. Access attempts return an error message.
INACTIVE
tenants are stored on local disk storage for quick activation.OFFLOADED
tenants are stored on cloud storage. This status is useful for long-term storage for tenants that are not frequently accessed.OFFLOADING
tenants are being moved to cloud storage. This is a transient status, and therefore not user-specifiable.ONLOADING
tenants are being loaded from cloud storage. This is a transient status, and therefore not user-specifiable. AnONLOADING
tenant may be being warmed to aACTIVE
status or aINACTIVE
status.
For more details on managing tenants, see Multi-tenancy operations.
Status | Available | Description | User-specifiable |
---|---|---|---|
ACTIVE | Yes | Loaded and available for read/write operations. | Yes |
INACTIVE | No | On local disk storage, no read / write access. Access attempts return an error message. | Yes |
OFFLOADED | No | On cloud storage, no read / write access. Access attempts return an error message. | Yes |
OFFLOADING | No | Being moved to cloud storage, no read / write access. Access attempts return an error message. | No |
ONLOADING | No | Being loaded from cloud storage, no read / write access. Access attempts return an error message. | No |
v1.26
In v1.26
, the HOT
status was renamed to ACTIVE
and the COLD
status was renamed to INACTIVE
.
A tenant state change may take some time to propagate across a cluster, especially a multi-node cluster.
For example, data may not be immediately available after reactivating an offloaded tenant. Similarly, data may not be immediately unavailable after offloading a tenant. This is because the tenant states are eventually consistent, and the change must be propagated to all nodes in the cluster.
Offloaded tenants
v1.26.0
As of Weaviate v1.26.0
, tenants can only be offloaded to cold storage in AWS S3. Additional storage options may be added in future releases.
To offload a tenant, use the offload-s3
module.
Offloading tenants requires the relevant offload-<storage>
module to be enabled in the Weaviate cluster.
When a tenant is offloaded, the entire tenant shard is moved to cloud storage. This is useful for long-term storage of tenants that are not frequently accessed. Offloaded tenants are not available for read or write operations until they are loaded back into the cluster.
Backups
Backups of multi-tenant collections will only include active
tenants, and not inactive
or offloaded
tenants. Activate tenants before creating a backup to ensure all data is included.
Tenancy and IDs
Each tenancy is like a namespace, so different tenants could, in theory, have objects with the same IDs. To avoid naming problems, object IDs in multi-tenant clusters combine the tenant ID and the object ID to create an ID that is unique across tenants.
Tenancy and cross-references
Multi-tenancy supports some cross-references.
Cross-references like these are supported:
- From a multi-tenancy object to a non-multi-tenancy object.
- From a multi-tenancy object to another multi-tenancy object, as long as they belong to the same tenant.
Cross-references like these are not supported:
- From a non-multi-tenancy object to a multi-tenancy object.
- From a multi-tenancy object to another multi-tenancy object if they belong to different tenants.
Key features
- Each tenant has a dedicated, high-performance vector index. Dedicated indexes mean faster query speeds. Instead of searching a shared index space, each tenant responds as if it was the only user on the cluster.
- Each tenant's data is isolated on a dedicated shard. This means that deletes are fast and do not affect other tenants.
- To scale out, add a new node to your cluster. Weaviate does not redistribute existing tenants, however Weaviate adds new tenants to the node with the least resource usage.
Monitoring metrics
To group tenants together for monitoring, set PROMETHEUS_MONITORING_GROUP = true
in your system configuration file.
Number of tenants per node
The number of tenants per node is limited by operating system constraints. The number of tenants cannot exceed the Linux open file limit per process.
For example, a 9-node test cluster built on n1-standard-8
machines holds around 170k active tenants. There are 18,000 to 19,000 tenants per node.
Note that these numbers relate to active tenants only. If you set unused tenants as inactive
, the open file per process limit does not apply.
Related pages
For more information, see the following:
Summary
- The schema defines collections and properties.
- Collections contain data objects that are describe in JSON documents.
- Data objects can contain a vector and properties.
- Vectors come from machine learning models.
- Different collections represent different vector spaces.
- Cross-references link objects between schemas.
- Multi-tenancy isolates data for each tenant.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.