Weaviate is designed to be easy to monitor and observe by following a cloud native approach. To do this Weaviate supports the following features
Publishing of Prometheus metrics to the standard
Use of built-in Kubernetes liveness and readiness checks
Configuration of settings via environment variables
Simplified deployment via helm charts
One common question though is: How can I integrate Weaviate with my existing observability stack?
This article describes two approaches using either Grafana agent or Datadog agent to scrape these metrics. It also provides a list of important metrics to monitor.
It is assumed that you have already deployed Weaviate. By default Prometheus monitoring is disabled, so you can enable it with this environment setting:
Weaviate will then publish Prometheus metrics on port
If you are using Weaviate
1.17 or lower, you may want to upgrade to
1.18 before enabling Prometheus metrics. The reason being Weaviate previously published many histograms which has since been replaced by summaries for performance reasons. Additionally, be careful enabling Prometheus metrics if you have many thousands of classes as you may end up with high cardinality labels due to some metrics being produced per class.
For the first approach we will use the open-source Grafana agent. In this case, we will show writing to Grafana Cloud for hosted metrics. This is configurable via the remote write section if you alternatively want to write to a self-hosted Mimir or Prometheus instance.
Steps to Install
1. Install Grafana agent in your target environment following the set-up guide.
2. Configure the Grafana
agent.yaml to include a scrape job called
weaviate. This will autodiscover Weaviate pods
in Kubernetes. The
app=weaviate label is automatically added by the Weaviate helm chart which makes autodiscovery easy.
- name: weaviate
# reference https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
- job_name: weaviate
- role: pod
- role: "pod"
- url: <Your Grafana.com prometheus push url>
username: <Your Grafana.com userid>
password: <Your Grafana.com API Key>
3. Validate that you are receiving data by going to explore and running the following PromQL query in Grafana.
One benefit of this approach is that you can now reuse the existing Weaviate Grafana dashboards.
Steps to import these dashboards:
1. Download and import the preexisting dashboards.
2. If you're using Grafana Cloud hosted Prometheus you will need to patch the dashboards to change the datasource uid to be
grafanacloud-prom as below.
sed 's/"uid": "Prometheus"/"uid": "grafanacloud-prom"/g' querying.json > querying-patched.json
The dashboards should now be visible!
Datadog is another popular solution for observability, and the Datadog agent has support for scraping Prometheus metrics.
Steps to Install
1. Install the datadog agent. For this example, installation was done using their Helm charts.
2. Provide a
datadog-values.yml config including the below. You can also capture Weaviate logs using the method.
# Note DD_KUBELET_TLS_VERIFY only needs to be set if running a local docker kubernetes cluster
# - name: DD_KUBELET_TLS_VERIFY
# value: "false"
- max_returned_metrics: 20000
3. Customize the Weaviate helm chart to have annotations
# Pass any annotations to Weaviate pods
4. Validate metrics are available.
go_memstats_heap_inuse_bytes should always be present even with an empty schema.
Below are some key Weaviate metrics to monitor. Standard CPU, Disk, Network metrics are also useful as are Kubernetes events. Note that some Weaviate metrics will not appear until an operation has occurred (for instance batch operations).
For heap usage, the expectation is the memory will have a standard jagged pattern underload but that memory will drop periodically due to the Go garbage collection. If memory is not dropping and is staying very close to the GOMEMLIMIT, you may need to increase resources.
Batch latency is important as batch operations are the most efficient way to write data to
Weaviate. Monitoring this can give an indication if there is a problem with indexing data. This metric has a label
allows you to see how long objects, vectors, and inverted index sub operations take. If you are using a vectorizer module you will see additional latency due to the overhead of sending data to the module.
For batch deletes the corresponding
batch_delete_durations_ms metric will also be useful.
Generally, batch indexing is recommended but there are situations where you would do single
such as handling live changes from a user in an application. In this case you will want to monitor the object latency
Query Latency and Rate
The latency and number of queries per second are also important, particularly for monitoring usage patterns.
Many other solutions that have integrations for Prometheus that can also be used:
Weaviate is open source, and you can follow the project on GitHub. Don’t forget to give us a ⭐️ while you are there!