Data structure
Data object nomenclature
Each data object in Weaviate belongs to a collection
and has one or more properties
.
Weaviate stores data objects
in class-based collections. Data objects are represented as JSON-documents. Objects normally include a vector
that is derived from a machine learning model. The vector is also called an embedding
or a vector embedding
.
Each collection contains objects of the same class
. The objects are defined by a common schema
.
Weaviate follows GraphQL naming conventions.
- Start collection names with an upper case letter.
- Start property names with a lower case letter.
If you use an initial upper case letter to define a property name, Weaviate changes it to a lower case letter internally.
JSON documents as objects
Imagine we need to store information about an author named Alice Munro. In JSON format the data looks like this:
{
"name": "Alice Munro",
"age": 91,
"born": "1931-07-10T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
}
Vectors
You can also attach vector
representations to your data objects. Vectors are arrays of numbers that are stored under the "vector"
property.
In this example, the Alice Munro
data object has a small vector. The vector is some information about Alice, maybe a story or an image, that a machine learning model has transformed into an array of numerical values.
{
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Alice Munro",
(...)
},
"vector": [
-0.16147631,
-0.065765485,
-0.06546908
]
}
To generate vectors for your data, use one of Weaviate's vectorizer modules. You can also use your own vectorizer.
Collections
Collections are groups of objects that share a schema definition.
In this example, the Author
collection holds objects that represent different authors.
The collection looks like this:
[{
"id": "dedd462a-23c8-32d0-9412-6fcf9c1e8149",
"class": "Author",
"properties": {
"name": "Alice Munro",
"age": 91,
"born": "1931-07-10T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Alice Ann Munro is a Canadian short story writer who won the Nobel Prize in Literature in 2013. Munro's work has been described as revolutionizing the architecture of short stories, especially in its tendency to move forward and backward in time."
},
"vector": [
-0.16147631,
-0.065765485,
-0.06546908
]
}, {
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Paul Krugman",
"age": 69,
"born": "1953-02-28T00:00:00.0Z",
"wonNobelPrize": true,
"description": "Paul Robin Krugman is an American economist and public intellectual, who is Distinguished Professor of Economics at the Graduate Center of the City University of New York, and a columnist for The New York Times. In 2008, Krugman was the winner of the Nobel Memorial Prize in Economic Sciences for his contributions to New Trade Theory and New Economic Geography."
},
"vector": [
-0.93070928,
-0.03782172,
-0.56288009
]
}]
Every collection has its own vector space. This means that different collections can have different embeddings of the same object.
Every object stored in Weaviate has a UUID. The UUID guarantees uniqueness across all collections.
Cross-references
If data objects are related, use cross-references to represent the relationships. Cross-references in Weaviate are like links that help you retrieve related information. Cross-references capture relationships, but they do not change the vectors of the underlying objects.
To create a reference, use a property from one collection to specify the value of a related property in the other collection.
Cross-reference example
For example, "Paul Krugman writes for the New York Times" describes a relationship between Paul Krugman and the New York Times. To capture that relationship, create a cross-reference between the Publication
object that represents the New York Times and the Author
object that represents Paul Krugman.
The New York Times Publication
object looks like this. Note the UUID in the "id"
field:
{
"id": "32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
"class": "Publication",
"properties": {
"name": "The New York Times"
},
"vector": [...]
}
The Paul Krugman Author
object adds a new property, writesFor
, to capture the relationship.
{
"id": "779c8970-0594-301c-bff5-d12907414002",
"class": "Author",
"properties": {
"name": "Paul Krugman",
...
"writesFor": [
{
"beacon": "weaviate://localhost/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6",
"href": "/v1/objects/32d5a368-ace8-3bb7-ade7-9f7ff03eddb6"
}
],
},
"vector": [...]
}
The value of the beacon
sub-property is the id
value from the New York Times Publication
object.
Cross-reference relationships are directional. To make the link bi-directional, update the Publication
collection to add a `hasAuthors
property points back to the Author
collection.
Multiple vectors
Weaviate collections support multiple, named vectors.
Collections can have multiple, named vectors. Each vector is independent. Each vector space has its own index, its own compression, and its own vectorizer. This means you use different vectorization models and apply different metrics to the same object.
If you use multiple vectors, adjust your queries to specify a target vector for vector or hybrid queries.
Weaviate Schema
Weaviate requires a data schema before you add data. However, you don't have to create a data schema manually. If you don't provide one, Weaviate generates a schema based on the incoming data.
Weaviate's schema defines its data structure in a formal language. In other words, it is a blueprint of how to organize and store the data.
The schema defines data classes (i.e. collections of objects), the properties within each class (name, type, description, settings), possible graph links between data objects (cross-references), and the vectorizer module (if any) to be used for the class, as well as settings such as the vectorizer module, and index configurations.
A Weaviate data schema is slightly different from a taxonomy. A taxonomy has a hierarchy. Read more about how taxonomies, ontologies and schemas are related in this Weaviate blog post.
Schemas fulfill several roles:
- Schemas define collections and properties.
- Schemas define cross-references that link collections, even collections that use different embeddings.
- Schemas let you configure module behavior, ANN index settings, reverse indexes, and other features on a collection level.
For details on configuring your schema, see the schema tutorial or schema configuration.
Multi-tenancy
- Multi-tenancy added in
v1.20
To separate data within a cluster, use multi-tenancy. Weaviate partitions the cluster into shards. Each shard holds data for a single tenant.
Sharding has several benefits:
- Data isolation
- Fast, efficient querying
- Easy and robust setup and clean up
Tenant shards are more lightweight. You can easily have 50,000, or more, active shards per node. This means that you can support 1M concurrently active tenants with just 20 or so nodes.
Multi-tenancy is especially useful when you want to store data for multiple customers, or when you want to store data for multiple projects.
Tenant status
- Tenant activity status setting added in
v1.21
OFFLOADED
status added inv1.26
Tenants can be ACTIVE
, INACTIVE
, OFFLOADED
, OFFLOADING
, or ONLOADING
.
ACTIVE
tenants are loaded and available for read and write operations.- In all other states, the tenant is not available for read or write access. Access attempts return an error message.
INACTIVE
tenants are stored on local disk storage for quick activation.OFFLOADED
tenants are stored on cloud storage. This status is useful for long-term storage for tenants that are not frequently accessed.OFFLOADING
tenants are being moved to cloud storage. This is a transient status, and therefore not user-specifiable.ONLOADING
tenants are being loaded from cloud storage. This is a transient status, and therefore not user-specifiable. AnONLOADING
tenant may be being warmed to aACTIVE
status or aINACTIVE
status.
For more details on managing tenants, see Multi-tenancy operations.
Status | Available | Description | User-specifiable |
---|---|---|---|
ACTIVE | Yes | Loaded and available for read/write operations. | Yes |
INACTIVE | No | On local disk storage, no read / write access. Access attempts return an error message. | Yes |
OFFLOADED | No | On cloud storage, no read / write access. Access attempts return an error message. | Yes |
OFFLOADING | No | Being moved to cloud storage, no read / write access. Access attempts return an error message. | No |
ONLOADING | No | Being loaded from cloud storage, no read / write access. Access attempts return an error message. | No |
v1.26
In v1.26
, the HOT
status was renamed to ACTIVE
and the COLD
status was renamed to INACTIVE
.
Offloaded tenants
Frozen, also called "offloaded" tenants, are introduced in Weaviate v1.26.0
. This requires the relevant offload-<storage>
module to be enabled in the Weaviate cluster.
As of Weaviate v1.26.0
, only S3-compatible cloud storage is supported for OFFLOADED
tenants through the offload-s3
module. Additional storage options may be added in future releases.
Tenancy and IDs
Each tenancy is like a namespace, so different tenants could, in theory, have objects with the same IDs. To avoid naming problems, object IDs in multi-tenant clusters combine the tenant ID and the object ID to create an ID that is unique across tenants.
Tenancy and cross-references
Multi-tenancy supports some cross-references.
Cross-references like these are supported:
- From a multi-tenancy object to a non-multi-tenancy object.
- From a multi-tenancy object to another multi-tenancy object, as long as they belong to the same tenant.
Cross-references like these are not supported:
- From a non-multi-tenancy object to a multi-tenancy object.
- From a multi-tenancy object to another multi-tenancy object if they belong to different tenants.
Key features
- Each tenant has a dedicated, high-performance vector index. Dedicated indexes mean faster query speeds. Instead of searching a shared index space, each tenant responds as if it was the only user on the cluster.
- Each tenant's data is isolated on a dedicated shard. This means that deletes are fast and do not affect other tenants.
- To scale out, add a new node to your cluster. Weaviate does not redistribute existing tenants, however Weaviate adds new tenants to the node with the least resource usage.
Monitoring metrics
To group tenants together for monitoring, set PROMETHEUS_MONITORING_GROUP = true
in your system configuration file.
Number of tenants per node
The number of tenants per node is limited by operating system constraints. The number of tenants cannot exceed the Linux open file limit per process.
For example, a 9-node test cluster built on n1-standard-8
machines holds around 170k active tenants. There are 18,000 to 19,000 tenants per node.
Note that these numbers relate to active tenants only. If you set unused tenants as inactive
, the open file per process limit does not apply.
Related pages
For more information, see the following:
Summary
- The schema defines collections and properties.
- Collections contain data objects that are describe in JSON documents.
- Data objects can contain a vector and properties.
- Vectors come from machine learning models.
- Different collections represent different vector spaces.
- Cross-references link objects between schemas.
- Multi-tenancy isolates data for each tenant.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.