Skip to main content

Consistency

Replication factor in Weaviate determines how many copies of shards (also called replicas) will be stored across a Weaviate cluster.

Replication factor

When the replication factor is > 1, consistency models balance the system's reliability, scalability, and/or performance requirements.

Weaviate uses multiple consistency models. One for its cluster metadata, and another for its data objects.

Consistency models in Weaviate

Weaviate uses the Raft consensus algorithm for cluster metadata replication. Cluster metadata in this context includes the collection definitions and tenant activity statuses. This allows cluster metadata updates to occur even when some nodes are down.

Data objects are replicated using a leaderless design using tunable consistency levels. So, data operations can be tuned to be more consistent or more available, depending on the desired tradeoff.

These designs reflect the trade-off between consistency and availability that is described in the CAP Theorem.

Rule of thumb on consistency

The strength of consistency can be determined by applying the following conditions:

  • If r + w > n, then the system is strongly consistent.
    • r is the consistency level of read operations
    • w is the consistency level of write operations
    • n is the replication factor (number of replicas)
  • If r + w <= n, then eventual consistency is the best that can be reached in this scenario.

Cluster metadata

The cluster metadata in Weaviate makes use of the Raft algorithm.

From v1.25, Weaviate uses the Raft consensus algorithm for cluster metadata replication. Raft is a consensus algorithm with an elected leader node that coordinates replication across the cluster using a log-based approach.

As a result, each request that changes the cluster metadata will be sent to the leader node. The leader node will apply the change to its logs, then propagate the changes to the follower nodes. Once a quorum of nodes has acknowledged the cluster metadata change, the leader node will commit the change and confirm it to the client.

This architecture ensures that cluster metadata changes are consistent across the cluster, even in the event of (a minority of) node failures.

Pre-v1.25 cluster metadata consensus algorithm

Prior to using Raft, a cluster metadata update was done via a Distributed Transaction algorithm. This is a set of operations that is done across databases on different nodes in the distributed network. Weaviate used a two-phase commit (2PC) protocol, which replicates the cluster metadata updates in a short period of time (milliseconds).

A clean (without fails) execution has two phases:

  1. The commit-request phase (or voting phase), in which a coordinator node asks each node whether they are able to receive and process the update.
  2. The commit phase, in which the coordinator commits the changes to the nodes.

Data objects

Weaviate uses two-phase commits for objects, adjusted for the consistency level. For example for a QUORUM write (see below), if there are 5 nodes, 3 requests will be sent out, each of them using a 2-phase commit under the hood.

As a result, data objects in Weaviate are eventually consistent. Eventual consistency provides BASE semantics:

  • Basically available: reading and writing operations are as available as possible
  • Soft-state: there are no consistency guarantees since updates might not yet have converged
  • Eventually consistent: if the system functions long enough, after some writes, all nodes will be consistent.

Weaviate uses eventual consistency to improve availability. Read and write consistency are tunable, so you can tradeoff between availability and consistency to match your application needs.

The animation below is an example of how a write or a read is performed with Weaviate with a replication factor of 3 and 8 nodes. The blue node acts as coordinator node. The consistency level is set to QUORUM, so the coordinator node only waits for two out of three responses before sending the result back to the client.

Write consistency QUORUM

Tunable write consistency

Adding or changing data objects are write operations.

note

Write operations are tunable starting with Weaviate v1.18, to ONE, QUORUM (default) or ALL. In v1.17, write operations are always set to ALL (highest consistency).

The main reason for introducing configurable write consistency in v1.18 is because that is also when automatic repairs are introduced. A write will always be written to n (replication factor) nodes, regardless of the chosen consistency level. The coordinator node however waits for acknowledgements from ONE, QUORUM or ALL nodes before it returns. To guarantee that a write is applied everywhere without the availability of repairs on read requests, write consistency is set to ALL for now. Possible settings in v1.18+ are:

  • ONE - a write must receive an acknowledgement from at least one replica node. This is the fastest (most available), but least consistent option.
  • QUORUM - a write must receive an acknowledgement from at least QUORUM replica nodes. QUORUM is calculated as n / 2 + 1, where n is the number of replicas (replication factor). For example, using a replication factor of 6, the quorum is 4, which means the cluster can tolerate 2 replicas down.
  • ALL - a write must receive an acknowledgement from all replica nodes. This is the most consistent, but 'slowest' (least available) option.

Figure below: a replicated Weaviate setup with write consistency of ONE. There are 8 nodes in total out of which 3 replicas.

Write consistency ONE

Figure below: a replicated Weaviate setup with Write Consistency of QUORUM (n/2+1). There are 8 nodes in total, out of which 3 replicas.

Write consistency QUORUM

Figure below: a replicated Weaviate setup with Write Consistency of ALL. There are 8 nodes in total, out of which 3 replicas.

Write consistency ALL

Tunable read consistency

Read operations are GET requests to data objects in Weaviate. Like write, read consistency is tunable, to ONE, QUORUM (default) or ALL.

note

Prior to v1.18, read consistency was tunable only for requests that obtained an object by id, and all other read requests had a consistency of ALL.

The following consistency levels are applicable to most read operations:

  • Starting with v1.18, consistency levels are applicable to REST endpoint operations.
  • Starting with v1.19, consistency levels are applicable to GraphQL Get requests.
  • ONE - a read response must be returned by at least one replica. This is the fastest (most available), but least consistent option.
  • QUORUM - a response must be returned by QUORUM amount of replica nodes. QUORUM is calculated as n / 2 + 1, where n is the number of replicas (replication factor). For example, using a replication factor of 6, the quorum is 4, which means the cluster can tolerate 2 replicas down.
  • ALL - a read response must be returned by all replicas. The read operation will fail if at least one replica fails to respond. This is the most consistent, but 'slowest' (least available) option.

Examples:

  • ONE
    In a single datacenter with a replication factor of 3 and a read consistency level of ONE, the coordinator node will wait for a response from one replica node.

    Write consistency ONE

  • QUORUM
    In a single datacenter with a replication factor of 3 and a read consistency level of QUORUM, the coordinator node will wait for n / 2 + 1 = 3 / 2 + 1 = 2 replicas nodes to return a response.

    Write consistency QUORUM

  • ALL
    In a single datacenter with a replication factor of 3 and a read consistency level of ALL, the coordinator node will wait for all 3 replicas nodes to return a response.

    Write consistency ALL

Tunable consistency strategies

Depending on the desired tradeoff between consistency and speed, below are three common consistency level pairings for write / read operations. These are minimum requirements that guarantee eventually consistent data:

  • QUORUM / QUORUM => balanced write and read latency
  • ONE / ALL => fast write and slow read (optimized for write)
  • ALL / ONE => slow write and fast read (optimized for read)

Tenant states and data objects

Each tenant in a multi-tenant collection has a configurable tenant state, which determines the availability and location of the tenant's data. The tenant state can be set to active, inactive, or offloaded.

An active tenant's data should be available for queries and updates, while inactive or offloaded tenants are not.

However, there can be a delay between the time a tenant state is set, and when the tenant's data reflects the (declarative) tenant state.

As a result, a tenant's data may be available for queries for a period of time even if the tenant state is set to inactive or offloaded. Conversely, a tenant's data may not be available for queries and updates for a period of time even if the tenant state is set to active.

Why is this not addressed by repair-on-read?

For speed, data operations on a tenant occur independently of any tenant activity status operations. As a result, tenant states are not updated by repair-on-read operations.

Repairs

When Weaviate detects inconsistent data across replicas, it attempts to repair the out of sync data.

Starting in v1.26, Weaviate adds async replication to proactively detect inconsistencies. In earlier versions, Weaviate uses a repair-on-read strategy to repair inconsistencies at read time.

Repair-on-read

Added in v1.18

If your read consistency is set to All or Quorum, the read coordinator will receive responses from multiple replicas. If these responses differ, the coordinator can attempt to repair the inconsistency, as shown in the examples below. This process is called "repair-on-read", or "read repairs".

ProblemAction
Object never existed on some replicas.Propagate the object to the missing replicas.
Object is out of date.Update the object on stale replicas.
Object was deleted on some replicas.Returns an error. Deletion may have failed, or the object may have been partially recreated.

The read repair process also depends on the read and write consistency levels used.

Write consistency levelRead consistency level
ONEALL
QUORUMQUORUM or ALL
ALL-

Repairs only happen on read, so they do not create a lot of background overhead. While nodes are in an inconsistent state, read operations with consistency level of ONE may return stale data.

Async replication

Added in v1.26

Async replication runs in the background. It uses a Merkle tree algorithm to monitor and compare the state of nodes within a cluster. If the algorithm identifies an inconsistency, it resyncs the data on the inconsistent node.

Repair-on-read works well with one or two isolated repairs. Async replication is effective in situations where there are many inconsistencies. For example, if an offline node misses a series of updates, async replication quickly restores consistency when the node returns to service.

Async replication supplements the repair-on-read mechanism. If a node becomes inconsistent between sync checks, the repair-on-read mechanism catches the problem at read time.

To activate async replication, set asyncEnabled to true in the replicationConfig section of your collection definition.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.