Core Knowledge
Getting started

Installation
Configuration
Client libraries

Schema

GraphQL references
RESTful API references

Modules

Roadmap
Architecture
Vector Index (ANN) Plugins
Benchmarks

Tutorials
More resources

Backups

Weaviate on Stackoverflow badge Weaviate issues on Github badge Weaviate v1.15.2 version badge Weaviate v1.15.2 version badge Weaviate total Docker pulls badge


Introduction

Weaviate’s Backup feature is designed to feel very easy to use and work natively with cloud technology. Most notably:

  • Neatless integration with widely-used cloud blob storage, such as AWS S3 or GCS
  • Backup and Restore between different storage providers
  • Single-command backup and restore from the REST API
  • Supports whole instance backups, as well as selecting specific classes
  • Zero downtimes & minimal effects for your users when backups are running
  • Easy Migrations to new environments

Note: The backup functionality was introduced in Weaviate v1.15

Configuration

In order to perform backups, a backup provider module must be activated. Multiple backup providers can be active at the same time. Currently Weaviate supports the the backup-s3, backup-gcs, and backup-filesystem providers. Providers are well decoupled which makes it easy to add new ones in the future.

All service-discovery and authentication-related configuration is set using environment variables.

S3 (AWS or S3-compatible)

Use the backup-s3 module to enable backing up to and restoring from any S3-compatible blob storage. This includes AWS S3, and MinIO.

To enable the module set the following environment variable

ENABLE_MODULES=backup-s3

Modules are comma-separated, for example to combine the module with the text2vec-transformers module, set:

ENABLE_MODULES=backup-s3,text2vec-transformers

S3 Configuration (vendor-agnostic)

In addition to activating the module, you need to provide configuration. This configuration applies to any S3-compatible backend.

Environment variableRequiredDescription
BACKUP_S3_BUCKETyesThe name of the S3 bucket for all backups.
BACKUP_S3_PATHnoThe root path inside your bucket that all your backups will be copied into and retrieved from. Optional, defaults to "" which means that the backups will be stored in the bucket root instead of a sub-folder.
BACKUP_S3_ENDPOINTnoThe S3 endpoint to be used. Optional, defaults to "s3.amazonaws.com".
BACKUP_S3_USE_SSLnoWhether the connection should be secured with SSL/TLS. Optional, defaults to "true".

S3 Configuration (AWS-specific)

In addition to the vendor-agnostic configuration from above, you can set AWS-specific configuration for authentication. You can choose between access-key or ARN-based authentication:

Option 1: With access key and secret access key

Environment variableDescription
AWS_ACCESS_KEY_IDThe id of the AWS access key for the desired account.
AWS_SECRET_ACCESS_KEYThe secret AWS access key for the desired account.
AWS_REGIONThe AWS Region. If not provided, the module will try to parse AWS_DEFAULT_REGION.

Option 2: With IAM and ARN roles

Environment variableDescription
AWS_ROLE_ARNThe unique AWS identifier of the role.
AWS_WEB_IDENTITY_TOKEN_FILEThe path to the web identity token file.
AWS_REGIONThe AWS Region. If not provided, the module will try to parse AWS_DEFAULT_REGION.

GCS (Google Cloud Storage)

Use the backup-gcs module to enable backing up to and restoring from any Google Cloud Storage.

To enable the module set the following environment variable

ENABLE_MODULES=backup-gcs

Modules are comma-separated, for example to combine the module with the text2vec-transformers module, set:

ENABLE_MODULES=backup-gcs,text2vec-transformers

In addition to activating the module, you need to provide configuration:

Environment variableRequiredDescription
BACKUP_GCS_PATHnoThe root path inside your bucket that all your backups will be copied into and retrieved from. Optional, defaults to "" which means that the backups will be stored in the bucket root instead of a sub-folder.

Google Application Default Credentials

The backup-gcs module follows the Google Application Default Credentials best-practices. This means that credentials can be discovered through the environment, through a local Google Cloud CLI setup, or through an attached service account.

This makes it easy to use the same module in different setups. For example, you can use the environment-based approach in production, and the CLI-based approach on your local machine. This way you can easily pull a backup that was created in a remote environment to your local system. This can be helpful in debugging an issue, for example.

Environment-based Configuration

Environment variableExample valueDescription
GOOGLE_APPLICATION_CREDENTIALS/your/google/credentials.jsonThe path to the secret GCP service account or workload identity file.
GCP_PROJECTmy-gcp-projectOptional. If you use a service account with GOOGLE_APPLICATION_CREDENTIALS the service account will already contain a Google project. You can use this variable to explicitly set a project if you are using user credentials which may have access to more than one project.

Filesystem

Instead of backing up to a remote backend, you can also backup to the local filesystem. This may be helpful during development, for example to be able to quickly exchange setups, or to save a state from accidental future changes.

To allow backups to the local filesystem, simply enable the backup-filesystem module like so:

ENABLE_MODULES=backup-filesystem

Modules are comma-separated, for example to combine the module with the text2vec-transformers module, set:

ENABLE_MODULES=backup-filesystem,text2vec-transformers

In addition to activating the module, you need to provide configuration:

Environment variableRequiredDescription
BACKUP_FILESYSTEM_PATHyesThe root path that all your backups will be copied into and retrieved from

Other Backup Backends

Weaviate uses its module system to decouple the backup orchestration from the remote backup storage backends. It is easy to add new providers and use them with the existing backup API. If you are missing your desired backup module, you can open a feature request or contribute it yourself. For either option, join our Slack community to have a quick chat with us on how to get started.

API

Create Backup

Once the modules are enabled and the configuration is provided, you can start a backup on any running instance with a single HTTP request.

Method and URL

POST /v1/backups/{backend}

Parameters

URL Parameters

nametyperequireddescription
backendstringyesThe name of the backup provider module without the backup- prefix, for example s3, gcp, or filesystem.

Request Body

The request takes a json object with the following properties:

nametyperequireddescription
idstring (lowercase letters, numbers, underscore, minus)yesThe id of the backup. This string must be provided on all future requests, such as status checking or restoration.
includelist of stringsnoAn optional list of class names to be included in the backup. If not set, all classes are included.
excludelist of stringsnoAn optional list of class names to be excluded from the backup. If not set, no classes are excluded.

Note: You cannot set include and exclude at the same time. Set none or exactly one of those.

  import weaviate

client = weaviate.Client('http://localhost:8080')

result = client.backup.create(
  backup_id='my-very-first-backup',
  backend='filesystem',
  include_classes=["Article", "Publication"],
  wait_for_completion=True,
)

print(result)
  const weaviate = require("weaviate-client");

const client = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

client.backup.creator()
  .withIncludeClassNames("Article", "Publication")
  .withBackend("filesystem")
  .withBackupId("my-very-first-backup")
  .withWaitForCompletion(true)
  .do()
  .then(console.log)
  .catch(console.error)
  package main

import (
  "context"
  "fmt"

  "github.com/semi-technologies/weaviate-go-client/v4/weaviate"
)

func main() {
  cfg := weaviate.Config{
    Host:   "localhost:8080",
    Scheme: "http",
  }
  client := weaviate.New(cfg)

  result, err := client.Backup().Creator().
    WithIncludeClassNames("Article", "Publication").
    WithBackend(backup.BACKEND_FILESYSTEM).
    WithBackupID("my-very-first-backup").
    WithWaitForCompletion(true).
    Do(context.Background())

  if err != nil {
    panic(err)
  }
  fmt.Printf("%v", result)
}
  package technology.semi.weaviate;

import technology.semi.weaviate.client.Config;
import technology.semi.weaviate.client.WeaviateClient;
import technology.semi.weaviate.client.base.Result;
import technology.semi.weaviate.client.v1.backup.model.Backend;
import technology.semi.weaviate.client.v1.backup.model.BackupCreateResponse;

public class App {
  public static void main(String[] args) {
    Config config = new Config("http", "localhost:8080");
    WeaviateClient client = new WeaviateClient(config);

    Result<BackupCreateResponse> result = client.backup().creator()
      .withIncludeClassNames("Article", "Publication")
      .withBackend(Backend.FILESYSTEM)
      .withBackupId("my-very-first-backup")
      .withWaitForCompletion(true)
      .run();

    if (result.hasErrors()) {
      System.out.println(result.getError());
      return;
    }
    System.out.println(result.getResult());
  }
}
  $ curl \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
         "id": "my-very-first-backup",
         "include": ["Article", "Publication"]
        }' \
    http://localhost:8080/v1/backups/filesystem

While you are waiting for a backup to complete, Weaviate stays fully usable.

Asynchronous Status Checking

All client implentations have a “wait for completion” option which will poll the backup status in the background and only return once the backup has completed (successfully or unsuccessfully).

If you set the “wait for completion” option to false, you can also check the status yourself using the Backup Creation Status API.

GET /v1/backups/{backend}/{backup_id}

Parameters

URL Parameters
nametyperequireddescription
backendstringyesThe name of the backup provider module without the backup- prefix, for example s3, gcp, or filesystem.
backup_idstringyesThe user-provided backup identifier that was used when sending the request to create the backup.

The response contains a "status" field. If the status is SUCCESS, the backup is complete. If the status is FAILED, an additional error is provided.

  import weaviate

client = weaviate.Client('http://localhost:8080')

result = client.backup.get_create_status(
  backup_id='my-very-first-backup',
  backend='filesystem',
)

print(result)
  const weaviate = require("weaviate-client");

const client = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

client.backup.createStatusGetter()
  .withBackend("filesystem")
  .withBackupId("my-very-first-backup")
  .do()
  .then(console.log)
  .catch(console.error)
  package main

import (
  "context"
  "fmt"

  "github.com/semi-technologies/weaviate-go-client/v4/weaviate"
)

func main() {
  cfg := weaviate.Config{
    Host:   "localhost:8080",
    Scheme: "http",
  }
  client := weaviate.New(cfg)

  result, err := client.Backup().CreateStatusGetter().
    WithBackend(backup.BACKEND_FILESYSTEM).
    WithBackupID("my-very-first-backup").
    Do(context.Background())

  if err != nil {
    panic(err)
  }
  fmt.Printf("%v", result)
}
  package technology.semi.weaviate;

import technology.semi.weaviate.client.Config;
import technology.semi.weaviate.client.WeaviateClient;
import technology.semi.weaviate.client.base.Result;
import technology.semi.weaviate.client.v1.backup.model.Backend;
import technology.semi.weaviate.client.v1.backup.model.BackupCreateStatusResponse;

public class App {
  public static void main(String[] args) {
    Config config = new Config("http", "localhost:8080");
    WeaviateClient client = new WeaviateClient(config);

    Result<BackupCreateStatusResponse> result = client.backup().createStatusGetter()
      .withBackend(Backend.FILESYSTEM)
      .withBackupId("my-very-first-backup")
      .run();

    if (result.hasErrors()) {
      System.out.println(result.getError());
      return;
    }
    System.out.println(result.getResult());
  }
}
  $ curl http://localhost:8080/v1/backups/filesystem/my-very-first-backup

Restore Backup

You can restore any backup to any machine as long as the number of nodes between source and target are identical. The backup does not need to be created on the same instance. Once a backup backend is configured, you can restore a backup with a single HTTP request.

Note that a restore fails if any of the classes already exist on this instance.

Method and URL

POST /v1/backups/{backend}/{backup_id}/restore

Parameters

URL Parameters

nametyperequireddescription
backendstringyesThe name of the backup provider module without the backup- prefix, for example s3, gcp, or filesystem.
backup_idstringyesThe user-provided backup identifier that was used when sending the request to create the backup.

Request Body

The request takes a json object with the following properties:

nametyperequireddescription
includelist of stringsnoAn optional list of class names to be included in the backup. If not set, all classes are included.
excludelist of stringsnoAn optional list of class names to be excluded from the backup. If not set, no classes are excluded.

Note 1: You cannot set include and exclude at the same time. Set none or exactly one of those.

Note 2: include and exclude is relative to the classes contained in the backup. The restore process does not know which classes existed on the source machine if they were not part of the backup.

  import weaviate

client = weaviate.Client('http://localhost:8080')

result = client.backup.restore(
  backup_id='my-very-first-backup',
  backend='filesystem',
  exclude_classes="Article",
  wait_for_completion=True,
)

print(result)
  const weaviate = require("weaviate-client");

const client = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

client.backup.restorer()
  .withExcludeClassNames("Article")
  .withBackend("filesystem")
  .withBackupId("my-very-first-backup")
  .withWaitForCompletion(true)
  .do()
  .then(console.log)
  .catch(console.error)
  package main

import (
  "context"
  "fmt"

  "github.com/semi-technologies/weaviate-go-client/v4/weaviate"
)

func main() {
  cfg := weaviate.Config{
    Host:   "localhost:8080",
    Scheme: "http",
  }
  client := weaviate.New(cfg)

  result, err := client.Backup().Restorer().
    WithExcludeClassNames("Article").
    WithBackend(backup.BACKEND_FILESYSTEM).
    WithBackupID("my-very-first-backup").
    WithWaitForCompletion(true).
    Do(context.Background())

  if err != nil {
    panic(err)
  }
  fmt.Printf("%v", result)
}
  package technology.semi.weaviate;

import technology.semi.weaviate.client.Config;
import technology.semi.weaviate.client.WeaviateClient;
import technology.semi.weaviate.client.base.Result;
import technology.semi.weaviate.client.v1.backup.model.Backend;
import technology.semi.weaviate.client.v1.backup.model.BackupRestoreResponse;

public class App {
  public static void main(String[] args) {
    Config config = new Config("http", "localhost:8080");
    WeaviateClient client = new WeaviateClient(config);

    Result<BackupRestoreResponse> result = client.backup().restorer()
      .withExcludeClassNames("Article")
      .withBackend(Backend.FILESYSTEM)
      .withBackupId("my-very-first-backup")
      .withWaitForCompletion(true)
      .run();

    if (result.hasErrors()) {
      System.out.println(result.getError());
      return;
    }
    System.out.println(result.getResult());
  }
}
  $ curl \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
         "id": "my-very-first-backup",
         "exclude": ["Article"]
        }' \
    http://localhost:8080/v1/backups/filesystem/my-very-first-backup/restore

Asynchronous Status Checking

All client implentations have a “wait for completion” option which will poll the backup status in the background and only return once the backup has completed (successfully or unsuccessfully).

If you set the “wait for completion” option to false, you can also check the status yourself using the Backup Restore Status API.

GET /v1/backups/{backend}/{backup_id}/restore

Parameters

URL Parameters
nametyperequireddescription
backendstringyesThe name of the backup provider module without the backup- prefix, for example s3, gcp, or filesystem.
backup_idstringyesThe user-provided backup identifier that was used when sending the requests to create and restore the backup.

The response contains a "status" field. If the status is SUCCESS, the restore is complete. If the status is FAILED, an additional error is provided.

  import weaviate

client = weaviate.Client('http://localhost:8080')

result = client.backup.get_restore_status(
  backup_id='my-very-first-backup',
  backend='filesystem',
)

print(result)
  const weaviate = require("weaviate-client");

const client = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

client.backup.restoreStatusGetter()
  .withBackend("filesystem")
  .withBackupId("my-very-first-backup")
  .do()
  .then(console.log)
  .catch(console.error)
  package main

import (
  "context"
  "fmt"

  "github.com/semi-technologies/weaviate-go-client/v4/weaviate"
)

func main() {
  cfg := weaviate.Config{
    Host:   "localhost:8080",
    Scheme: "http",
  }
  client := weaviate.New(cfg)

  result, err := client.Backup().RestoreStatusGetter().
    WithBackend(backup.BACKEND_FILESYSTEM).
    WithBackupID("my-very-first-backup").
    Do(context.Background())

  if err != nil {
    panic(err)
  }
  fmt.Printf("%v", result)
}
  package technology.semi.weaviate;

import technology.semi.weaviate.client.Config;
import technology.semi.weaviate.client.WeaviateClient;
import technology.semi.weaviate.client.base.Result;
import technology.semi.weaviate.client.v1.backup.model.Backend;
import technology.semi.weaviate.client.v1.backup.model.BackupRestoreStatusResponse;

public class App {
  public static void main(String[] args) {
    Config config = new Config("http", "localhost:8080");
    WeaviateClient client = new WeaviateClient(config);

    Result<BackupRestoreStatusResponse> result = client.backup().restoreStatusGetter()
      .withBackend(Backend.FILESYSTEM)
      .withBackupId("my-very-first-backup")
      .run();

    if (result.hasErrors()) {
      System.out.println(result.getError());
      return;
    }
    System.out.println(result.getResult());
  }
}
  $ curl http://localhost:8080/v1/backups/filesystem/my-very-first-backup/restore

Technical Considerations

Read & Write requests while a backup is running

The backup process is designed to be minimally invasive to a running setup. Even on very large setups, where terrabytes of data need to be copied, Weaviate stays fully usable. It even accepts write requests while a backup process is running. This sections explains how backups work under the hood and why Weaviate can safely accept writes while a backup is copied.

Weaviate uses a custom LSM Store for it’s object store and inverted index. LSM stores are a hybrid of immutable disk segments and an in-memory structure called a memtable that accepts all writes (including updates and deletes). For most of the time, files on disk are immutable, there are only three situations where files are changed:

  1. Anytime a memtable is flushed. This creates a new segment. Existing segments are not changed.
  2. Any write into the memtable is also written into a Write-Ahead-Log (WAL). The WAL is only needed for disaster-recovery. Once a segment has been orderly flushed, the WAL can be discarded.
  3. There is an async background process called Compaction that optimizes existing segments. It can merge two small segments into a larger big segment and remove redundant data as part of the process.

Weaviate’s Backup implementation makes use of the above properties in the following ways:

  1. Weaviate first flushes all active memtables to disk. This process takes in the 10s or 100s of milliseconds. Any pending write requests simply waits for a new memtable to be created without any failing requests or substantial delays.
  2. Now that the memtables are flushed, there is a guarantee: All data that should be part of the backup is present in the existing disk segments. Any data that will be imported after the backup request ends up in new disk segments. The backup references a list of immutable files.
  3. To prevent a compaction process from changing the files on disk while they are being copied, compactions are temporarily paused until all files have been copied. They are automatically resumed right after.

This way the backup process can guarantee that the files that are transferred to the remote backend are immutable (and thus safe to copy) even with new writes coming in. Even if it takes minutes or hours to backup a very large setup, Weaviate stays fully usable without any user impact while the backup process is running.

It is not just safe - but even recommended - to create backups on live production instances while they are serving user requests.

Async nature of the Backup API

The backup API is built in a way that no long-running network requests are required. The request to create a new backup returns immediately. It does some basic validation, then returns to the user. The backup is now in status STARTED. To get the status of a running backup you can use poll the status endpoint. This makes the backup itself resilient to network or client failures.

If you would like your application to wait for an async process to complete you can use the “wait for completion” feature that is present in all language clients. The clients will poll the status endpoint in the background and block until the status is either SUCCESS or FAILED. This makes it easy to write simple synchronous backup scripts, even with the async nature of the API.

Limitations & Outlook

As of Weaviate v1.15, backups are limited to single-node setups. Weaviate v1.16 will introduce support for multi-node setups. You can read the technical proposal and track the progress on the feature here. The same proposal will also make backups more resiliant against node restarts. In v1.15 an unexpected node restart during a backup operation leads to a failed backup. You can always create a new backup after the restart.

Other Use cases

Migrating to another environment

The flexibility around backup providers opens up new use cases. Besides using the backup & restore feature for disaster recovery, you can also use it for duplicating environments or migrating between clusters.

For example, consider the following situation: You would like to do a load test on production data. If you would do the load test in production it might affect users. An easy way to get meaningful results without affecting uses it to duplicate your entire environment. Once the new production-like “loadtest” environment is up, simple create a backup from your production environment and restore it into your “loadtest” environment. This even works if the production environment is running on a completely different cloud provider than the new environment.

More Resources

If you can’t find the answer to your question here, please look at the:

  1. Frequently Asked Questions. Or,
  2. Knowledge base of old issues. Or,
  3. For questions: Stackoverflow. Or,
  4. For issues: Github. Or,
  5. Ask your question in the Slack channel: Slack.