Skip to main content

ANN

LICENSEΒ Weaviate on Stackoverflow badgeΒ Weaviate issues on Github badgeΒ Weaviate version badgeΒ Weaviate total Docker pulls badgeΒ Go Report Card

About this benchmark

This benchmark is designed to measure and illustrate Weaviate's ANN performance for a range of real-life use cases.

note

This is not a comparative benchmark that runs Weaviate against competing solutions.

To make the most of this benchmark, you can look at it from different perspectives:

  • The overall performance – Review the benchmark result section below to draw conclusions about what to expect from Weaviate in a production setting.
  • Expectation for your use case – Find the dataset closest to your production use case, and estimate Weaviate's expected performance for your use case.
  • Fine Tuning – If you don't get the results you expect. Find the optimal combinations of the config parameters (efConstruction, maxConnections and ef) to achieve the best results for your production configuration.

What is being measured?​

For each benchmark test, we picked parameters of:

  • efConstruction - The HNSW build parameter that controls the quality of the search at build time.
  • maxConnections - The HNSW build parameter controls how many outgoing edges a node can have in the HNSW graph.
  • ef - The HNSW query time parameter that controls the quality of the search.

For each set of parameters we've run 10000 requests and we measured:

  • The Recall@1, Recall@10, Recall@100 - by comparing Weaviate's results to the ground truths specified in each dataset
  • Multi-threaded Queries per Second (QPS) - The overall throughput you can achieve with each configuration
  • Individual Request Latency (mean) - The mean latency over all 10,000 requests
  • P99 Latency - 99% of all requests (9.900 out of 10.000) have a latency that is lower than or equal to this number – this shows how fast
  • Import time - Since varying build parameters has an effect on import time, the import time is also included

By request, we mean: An unfiltered vector search across the entire dataset for the given test. All latency and throughput results represent the end-to-end time that your users would also experience. In particular, these means:

  • Each request time includes the network overhead for sending the results over the wire. In the test setup, the client and server machines were located in the same VPC.
  • Each request includes retrieving all the matched objects from disk. This is a significant difference from ann-benchmarks, where the embedded libraries only return the matched IDs.

Benchmark Setup​

Scripts​

This benchmark is produced using open-source scripts, so you can reproduce it yourself.

Hardware​

Setup with Weaviate and benchmark machine

For the purpose of this benchmark we've used two GCP instances within the same VPC:

  • Benchmark – a c2-standard-30 instance with 30 vCPU cores and 120 GB memory – to host Weaviate.
  • Script – a smaller instance with 8 vCPU – to run benchmarking scripts.

πŸ’‘ the c2-standard-30 was chosen for benchmarking for two reasons:

  • It is large enough to show that Weaviate is a highly-concurrent vector search engine and scales well while running thousands of searches across multiple threads.
  • It is small enough to represent a typical production case without inducing high costs.

Based on your throughput requirements, it is very likely that you will run Weaviate on a considerably smaller or larger machine in production.

In this section below we have outlined what you should expect when altering the configuration or setup parameters.

Experiment Setup​

The selection of datasets is modeled after ann-benchmarks. The same test queries are used to test speed, throughput, and recall. The provided ground truths are used to calculate the recall.

The imports were performed using Weaviate's python clients. The concurrent (multi-threaded) queries were measured using Go. Each language may have a slightly different performance, and you may experience different results if you send your queries using another language. For the maximum throughput, we recommend using the Go or Java clients.

The complete import and test scripts are available here.

Results

A guide for picking the right dataset

The following results section contains multiple datasets. To get the most of this benchmark, pick the dataset that is closest to the use case that reflects your data in production based on the following criteria:

  • SIFT1M - A dataset containing 1 million objects of 128d and using l2 distance metrics. This dataset reflects a common use case with a small number of objects.
  • Glove-25 - While similar in data size to SIFT1M, each vector only has 25 dimensions in this dataset. Because of the smaller vectors Weaviate can achieve the highest throghput on this dataset. The distance metric used is angular (cosine distance).
  • Deep Image 96 - This dataset contains 10 million objects at 96d, and is therefore about 10 times as large as SIFT1M. The throughput is only slightly lower than that of the SIFT1M. This dataset gives you a good indication of expected speeds and throughputs when datasets grow.
  • GIST 960 - This dataset contains 1 million objects at 960d. It has the lowest throughput of the datasets outlined. It highlights the cost of vector comparisons with a lot of dimensions. Pick this dataset if you run very high-dimensional loads.

For each dataset, there is a highlighted configuration. The highlighted configuration is an opinionated pick about a good recall/latency/throughput trade-off. The highlight sections will give you a good overview of Weaviate's performance with the respective dataset. Below the highlighted configuration, you can find alternative configurations.

SIFT1M (1M 128d vectors, L2 distance)​

Highlighted Configuration​

1.0M128l212832
Dataset SizeDimensionsDistance MetricefConstructionmaxConnections
98.83%89053.31ms4.49ms
Recall@10QPS (Limit 10)Mean Latency (Limit 10)p99 Latency (Limit 10)

All Results​

QPS vs Recall​

SIFT1M Benchmark results

efConstructionmaxConnectionsefRecallQPSQPS/vCoreMean Latencyp99 LatencyImport time
6486490.91%114453812.59ms3.44ms186s
51286495.74%113913802.6ms3.4ms286s
128166498.52%104433482.83ms3.77ms204s
512166498.69%102873432.87ms3.94ms314s
128326498.92%97603253.03ms4.15ms203s
256326499.0%94623153.13ms4.36ms243s
512326499.22%92493083.2ms4.68ms351s
5123212899.29%71552384.14ms5.84ms351s
1283225699.34%56941905.21ms6.94ms203s
2563251299.37%35781198.27ms11.2ms243s
How to read the results table
  • Choose the desired limit using the tab selector above the table.
    Different use cases require different levels of Queries per Second (QPS) and returned objects per query. For example, at 100 QPS and limit 10010,000 objects will be returned in total. At 1,000 QPS and limit 10, you will also receive 10,000 objects in total as each request contains fewer objects, but you can send more requests in the same timespan. Pick the value that matches your desired limit in production most closely.
  • Pick the desired configuration
    The first three columns represent the different input parameters to configure the HNSW index. These inputs lead to the results shown in columns four through six.
  • Recall/Throughput Trade-Off at a glance
    The highlighted columns (Recall, QPS) reflect the Recall/QPS trade-off. Generally, as the Recall improves, the throughput drops. Pick the row that represents a combination that satisfies your requirements. Since the benchmark is multi-threaded and running on a 30-core machine, the QPS/vCore columns shows the throughput per single CPU core. You can use this column to extrapolate what the throughput would be like on a machine of different size. See also this section belowoutlining what changes to expect when running on different hardware.
  • Latencies
    Besides the overall throughput, columns seven and eight show the latencies for individual requests. The Mean Latency columns shows the mean over all 10,000 test queries. The p99 Latency shows the maximum latency for the 99th-percentile of requests. In other words, 9,900 out of 10,000 queries will have a latency equal to or lower than the specified number. The difference between mean and p99 helps you get an impression how stable the request times are in a highly concurrent setup.
  • Import times
    Changing the configuration parameters can also have an effect on the time it takes to import the dataset. This is shown in the last column.

Glove-25 (1.2M 25d vectors, cosine distance)​

Highlighted Configuration​

1.28M35cosine6416
Dataset SizeDimensionsDistance MetricefConstructionmaxConnections
95.56%150031.93ms2.94ms
Recall@10QPS (Limit 10)Mean Latency (Limit 10)p99 Latency (Limit 10)

All Results​

QPS vs Recall​

Glove25 Benchmark results

efConstructionmaxConnectionsefRecallQPSQPS/vCoreMean Latencyp99 LatencyImport time
12886494.47%197526581.48ms2.49ms178s
51286495.3%197046571.48ms2.61ms272s
512166499.26%175835861.65ms2.82ms308s
256166499.27%171775731.7ms2.67ms232s
128326499.72%154435151.9ms2.83ms184s
256326499.84%151875061.93ms2.82ms261s
512326499.89%144014802.04ms2.93ms354s
256646499.9%134904502.17ms3.17ms276s
512646499.96%126264212.32ms3.37ms388s
5126412899.98%86652893.38ms4.82ms388s
2563225699.98%71912404.07ms5.77ms261s
1286425699.99%69582324.18ms6.17ms195s
51232256100.0%66942234.39ms5.99ms354s
12832512100.0%45681526.4ms9.27ms184s
How to read the results table
  • Choose the desired limit using the tab selector above the table.
    Different use cases require different levels of Queries per Second (QPS) and returned objects per query. For example, at 100 QPS and limit 10010,000 objects will be returned in total. At 1,000 QPS and limit 10, you will also receive 10,000 objects in total as each request contains fewer objects, but you can send more requests in the same timespan. Pick the value that matches your desired limit in production most closely.
  • Pick the desired configuration
    The first three columns represent the different input parameters to configure the HNSW index. These inputs lead to the results shown in columns four through six.
  • Recall/Throughput Trade-Off at a glance
    The highlighted columns (Recall, QPS) reflect the Recall/QPS trade-off. Generally, as the Recall improves, the throughput drops. Pick the row that represents a combination that satisfies your requirements. Since the benchmark is multi-threaded and running on a 30-core machine, the QPS/vCore columns shows the throughput per single CPU core. You can use this column to extrapolate what the throughput would be like on a machine of different size. See also this section belowoutlining what changes to expect when running on different hardware.
  • Latencies
    Besides the overall throughput, columns seven and eight show the latencies for individual requests. The Mean Latency columns shows the mean over all 10,000 test queries. The p99 Latency shows the maximum latency for the 99th-percentile of requests. In other words, 9,900 out of 10,000 queries will have a latency equal to or lower than the specified number. The difference between mean and p99 helps you get an impression how stable the request times are in a highly concurrent setup.
  • Import times
    Changing the configuration parameters can also have an effect on the time it takes to import the dataset. This is shown in the last column.

Deep Image 96 (9.99M 96d vectors, cosine distance)​

Highlighted Configuration​

9.99M96cosine12832
Dataset SizeDimensionsDistance MetricefConstructionmaxConnections
96.43%61124.7ms15.87ms
Recall@10QPS (Limit 10)Mean Latency (Limit 10)p99 Latency (Limit 10)

All Results​

QPS vs Recall​

Deep Image 96 Benchmark results

efConstructionmaxConnectionsefRecallQPSQPS/vCoreMean Latencyp99 LatencyImport time
64166494.44%93013103.14ms7.21ms3305s
128166496.06%89572993.28ms7.24ms3804s
64646496.84%87602923.36ms6.97ms3253s
128326497.88%84732823.48ms7.4ms3533s
128646498.27%79842663.66ms7.52ms3631s
256326498.78%79162643.71ms7.83ms4295s
512326498.95%78762633.73ms7.47ms5477s
256646499.06%78392613.75ms7.21ms4392s
512646499.32%72382414.05ms7.67ms6039s
2566412899.42%57671925.1ms8.39ms4392s
5126412899.52%55091845.34ms8.7ms6039s
2563225699.66%46721566.32ms10.11ms4295s
5123225699.82%44671496.62ms10.29ms5477s
5126425699.9%36831237.97ms12.72ms6039s
5123251299.94%28429510.37ms15.25ms5477s
5126451299.95%22887612.84ms20.72ms6039s
How to read the results table
  • Choose the desired limit using the tab selector above the table.
    Different use cases require different levels of Queries per Second (QPS) and returned objects per query. For example, at 100 QPS and limit 10010,000 objects will be returned in total. At 1,000 QPS and limit 10, you will also receive 10,000 objects in total as each request contains fewer objects, but you can send more requests in the same timespan. Pick the value that matches your desired limit in production most closely.
  • Pick the desired configuration
    The first three columns represent the different input parameters to configure the HNSW index. These inputs lead to the results shown in columns four through six.
  • Recall/Throughput Trade-Off at a glance
    The highlighted columns (Recall, QPS) reflect the Recall/QPS trade-off. Generally, as the Recall improves, the throughput drops. Pick the row that represents a combination that satisfies your requirements. Since the benchmark is multi-threaded and running on a 30-core machine, the QPS/vCore columns shows the throughput per single CPU core. You can use this column to extrapolate what the throughput would be like on a machine of different size. See also this section belowoutlining what changes to expect when running on different hardware.
  • Latencies
    Besides the overall throughput, columns seven and eight show the latencies for individual requests. The Mean Latency columns shows the mean over all 10,000 test queries. The p99 Latency shows the maximum latency for the 99th-percentile of requests. In other words, 9,900 out of 10,000 queries will have a latency equal to or lower than the specified number. The difference between mean and p99 helps you get an impression how stable the request times are in a highly concurrent setup.
  • Import times
    Changing the configuration parameters can also have an effect on the time it takes to import the dataset. This is shown in the last column.

GIST 960 (1.0M 960d vectors, cosine distance)​

Highlighted Configuration​

1.00M960cosine51232
Dataset SizeDimensionsDistance MetricefConstructionmaxConnections
94.14%193515.05ms19.86ms
Recall@10QPS (Limit 10)Mean Latency (Limit 10)p99 Latency (Limit 10)

All Results​

QPS vs Recall​

GIST 960 Benchmark results

efConstructionmaxConnectionsefRecallQPSQPS/vCoreMean Latencyp99 LatencyImport time
6486466.6%27599210.59ms13.77ms1832s
12886470.7%27349110.7ms13.97ms1861s
51286475.0%27249110.78ms14.87ms2065s
64166479.8%26188711.04ms14.69ms1838s
128166483.9%25778611.21ms15.55ms1904s
256166487.1%25188411.54ms14.49ms2016s
128326489.6%24258111.85ms15.37ms1931s
256326492.6%23888012.09ms15.99ms2074s
256646494.1%22077413.08ms18.56ms2130s
512326494.6%20736914.11ms17.37ms2361s
5123212896.2%19856614.67ms19.32ms2361s
512646496.2%19516514.7ms19.61ms2457s
5121625696.2%18396115.9ms19.84ms2217s
5126412896.7%16035318.06ms24.44ms2457s
5123225698.7%15145019.16ms24.43ms2361s
5123251299.1%9993329.12ms38.89ms2361s
How to read the results table
  • Choose the desired limit using the tab selector above the table.
    Different use cases require different levels of Queries per Second (QPS) and returned objects per query. For example, at 100 QPS and limit 10010,000 objects will be returned in total. At 1,000 QPS and limit 10, you will also receive 10,000 objects in total as each request contains fewer objects, but you can send more requests in the same timespan. Pick the value that matches your desired limit in production most closely.
  • Pick the desired configuration
    The first three columns represent the different input parameters to configure the HNSW index. These inputs lead to the results shown in columns four through six.
  • Recall/Throughput Trade-Off at a glance
    The highlighted columns (Recall, QPS) reflect the Recall/QPS trade-off. Generally, as the Recall improves, the throughput drops. Pick the row that represents a combination that satisfies your requirements. Since the benchmark is multi-threaded and running on a 30-core machine, the QPS/vCore columns shows the throughput per single CPU core. You can use this column to extrapolate what the throughput would be like on a machine of different size. See also this section belowoutlining what changes to expect when running on different hardware.
  • Latencies
    Besides the overall throughput, columns seven and eight show the latencies for individual requests. The Mean Latency columns shows the mean over all 10,000 test queries. The p99 Latency shows the maximum latency for the 99th-percentile of requests. In other words, 9,900 out of 10,000 queries will have a latency equal to or lower than the specified number. The difference between mean and p99 helps you get an impression how stable the request times are in a highly concurrent setup.
  • Import times
    Changing the configuration parameters can also have an effect on the time it takes to import the dataset. This is shown in the last column.

Learn more & FAQ

What is the difference between latency and throughput?​

The latency refers to the time it takes to complete a single request. This is typically measured by taking a mean or percentile distribution of all requests. For example, a mean latency of 5ms means that a single request takes on average 5ms to complete. This does not say anything about how many queries can be answered in a given timeframe.

If Weaviate were single-threaded, the throughput per second would roughly equal to 1s divided by mean latency. For example, with a mean latency of 5ms, this would mean that 200 requests can be answered in a second.

However, in reality, you often don't have a single user sending one query after another. Instead, you have multiple users sending queries. This makes the querying-side concurrent. Similarly, Weaviate can handle concurrent incoming requests. We can identify how many concurrent requests can be served by measuring the throughput.

We can take our single-thread calculation from before and multiply it with the number of server CPU cores. This will give us a rough estimate of what the server can handle concurrently. However, it would be best never to trust this calculation alone and continuously measure the actual throughput. This is because such scaling may not always be linear. For example, there may be synchronization mechanisms used to make concurrent access safe, such as locks. Not only do these mechanisms have a cost themselves, but if implemented incorrectly, they can also lead to congestion which would further decrease the concurrent throughput. As a result, you cannot perform a single-threaded benchmark and extrapolate what the numbers would be like in a multi-threaded setting.

All throughput numbers ("QPS") outlined in this benchmark are actual multi-threaded measurements on a 30-core machine, not estimations.

What is a p99 latency?​

The mean latency gives you an average value of all requests measured. This is a good indication of how long a user will have to wait on average for their request to be completed. Based on this mean value, you cannot make any promises to your users about wait times. 90 out of 100 users might see a considerably better time, but the remaining 10 might see a significantly worse time.

To give a more precise indication, percentile-based latencies are used. A 99th-percentile latency - or "p99 latency" for short - indicates the slowest request that 99% of requests experience. In other words, 99% of your users will experience a time equal to or better than the stated value. This is a much better guarantee than a mean value.

In production settings, requirements - as stated in SLAs - are often a combination of throughput and a percentile latency. For example, the statement "3000 QPS at p95 latency of 20ms" conveys the following meaning.

  • 3000 requests need to be successfully completed per second
  • 95% of users must see a latency of 20ms or lower.
  • There is no assumption about the remaining 5% of users, implicitly tolerating that they will experience higher latencies than 20ms.

The higher the percentile (e.g. p99 over p95) the "safer" the quoted latency becomes. We have thus decided to use p99-latencies instead of p95-latencies in our measurements.

What happens if I run with fewer or more CPU cores than on the example test machine?​

The benchmark outlines a QPS per core measurement. This can help you make a rough estimation of how the throughput would vary on smaller or larger machines. If you do not need the stated throughput, you can run with fewer CPU cores. If you need more throughput, you can run with more CPU cores.

Please note that there is a point of diminishing returns with adding more CPUs because of synchronization mechanisms, disk, and memory bottlenecks.​ ​Beyond that point, you can scale horizontally instead of vertically. Horizontal scaling with replication will be ​​available in Weaviate soon​​.

What are ef, efConstruction, and maxConnections?​

These parameters refer to the HNSW build and query parameters. They represent a trade-off between recall, latency & throughput, index size, and memory consumption. This trade-off is highlighted in the benchmark results.

I can't match the same latencies/throughput in my own setup, how can I debug this?​

If you are encountering other numbers in your own dataset, here are a couple of hints to look at:

  • What CPU architecture are you using? The benchmarks above were run on a GCP c2 CPU type, which is based on amd64 architecture. Weaviate also supports arm64 architecture, but not all optimizations are present. If your machine shows maximum CPU usage but you cannot achieve the same throughput, consider switching the CPU type to the one used in this benchmark.

  • Are you using an actual dataset or random vectors? HNSW is known to perform considerably worse with random vectors than with real​ world​ datasets. This is due to the distribution of points in real world​ datasets compared to randomly generated vectors. If you cannot achieve the performance (or recall) outlined above with random vectors, switch to an actual dataset.

  • Are your disks fast enough? While the ANN search itself is CPU-bound, the objects must be read from disk after the search has been completed. Weaviate uses memory-mapped files to speed this process up. However, if not enough memory is present or the operating system has allocated the cached pages elsewhere, a physical disk read needs to occur. If your disk is slow, it could then be that your benchmark is bottlenecked by those disks.

  • Are you using more than 2 million vectors? If yes, make sure to set the vector cache large enough for maximum performance.

Where can I find the scripts to run this benchmark myself?​

The repository is located here.