With the rising popularity of machine learning models, the demand for vector similarity search solutions has also increased dramatically. Machine learning models typically output vectors and common search queries involve finding the closest set of related vectors. For example, for a text-based vector search the search query "landmarks in Paris" would be encoded to a vector, it is then the job of the vector database to find the documents with the vector closest to this query. This might be a document with the title "Eiffel Tower" whose vector matched the search vector most closely.
Such a vector similarity comparison is computationally trivial if there are very few, say less than 10,000 objects to be searched through. That is not a very realistic situation, however. The amount of unstructured data that companies and users need to search through is massive. The number of objects is often in the high millions or even billions of data points. A primitive brute-force comparison can no longer return results in an acceptable time frame.
This led to the rise of Approximate Nearest Neighbor (ANN) models. You might have heard of Spotify's Annoy, Facebook's faiss, or Google's ScaNN. What they all have in common is that they make a conscious trade-off between accuracy (precision, recall) and retrieval speed. This has enabled users to retrieve the 5, 10, or 100 closest vectors within a few milliseconds — even out of a billion objects.
However, there is yet another trade-off engineers have to make when using ANN models. Many of the common ANN libraries fall short of some of the features we are used to when working with traditional databases and search engines. Imagine you were using a MySQL database, but after you imported your data it would be read-only for the rest of time. That's not how it works, right? So why should it work like this for vector searching? In this article, I'm going to introduce you to Weaviate, a vector database that removes many of the limitations imposed by ANN libraries.
In this article we will cover:
- how ANN models enable fast & large-scale vector searches
- where popular ANN libraries fall short
- what Weaviate is and how it can bring your vector search needs to production
- a glimpse of how Weaviate works under the hood
What is Weaviate?
Before we dive deeper into how Weaviate removes the limitations you might expect from existing ANN solutions, let's quickly take a look at what Weaviate is. Weaviate is a cloud-native, modular, real-time vector database built to scale your machine learning models. Oh, it's also open-source, by the way. Because of its modularity, Weaviate can cover a wide variety of bases. By default, Weaviate is agnostic of how you came up with your vectors. This means teams with experience in data science and machine learning can simply keep using their finely-tuned ML models and import their data objects alongside their existing vector positions. At the same time, Weaviate comes with optional modules for text, images, and other media types. Those modules can do the vectorization for you. Thus, if you are new to the world of machine learning you can import your objects just as you would with a traditional database and let Weaviate handle the vectorization. This brings the benefits of vector searching to the masses — even without an ML background. The ability to combine modules also allows for relations between e.g. a text object and a corresponding image object. As this article will show, Weaviate is also not restricted by some of the limitations of popular ANN libraries.
When do you call a database a database?
If you were to ask one hundred engineers what defines a database you will most likely get as many different answers. We might take the following points for granted in the world of SQL and NoSQL databases, yet they are anything but ubiquitous in the context of vector searches.
- CRUD We are used to databases allowing us to create, read, update and delete objects.
- Real-time (or near real-time) Objects imported into a database are typically queryable within a very short period. Some databases have refresh intervals and some databases make use of eventual consistency. Both can lead to a slight delay, but in general, objects are present within seconds as opposed to minutes or hours.
- Mutability Databases are generally mutable. This does not only refer to individual objects (as is already covered with CRUD abilities), but also to the entire database or a collection therein, such as a table in a SQL database. We are used to adding another row as we please.
- Persistence We generally expect database writes to be persistent. This does not mean that an in-memory database is not a database. But it does mean that when a database uses a disk, we do not expect writes to get lost — for example, on an unexpected restart.
- Consistency. Resiliency and Scaling Some databases have atomic transactions, others don't support consistency at all. Some databases scale really well in the cloud, others work best as big, single machines. The exact characteristics do not make a database a database. But the fact that the engineers who built it have decided on a specific strategy is a key trait of a database.
Popular ANN libraries fall short of many of the above principles
The biggest downside of ANN libraries is with regards to their real-time capabilities and mutability. Libraries, such as Annoy or ScaNN require you to import all of your data objects upfront. This is then followed by a build period to build up the index. After that's done, it's a read-only model. This brings two major disadvantages with regards to the points we outlined before: First up, if you import your first object it is not yet queryable until all objects are imported. If you import a billion objects this can be a very considerable time. They are therefore not real-time. Furthermore, once you have built your index, you cannot alter the index anymore without rebuilding it from scratch. Thus, operations, such as updating or deleting are impossible and the search model cannot be considered mutable — not even for further inserts.
The persistence strategy of the above models is typically snapshotting. So while you can save the index to disk and also reload it from disk, an individual write is typically not persisted. As Annoy, ScaNN and the like are not applications, but libraries, this also means that coming up with resiliency and scaling strategy is left to the developer. It is certainly not difficult to scale a read-only model. However, the feature set of simply duplicating your model does not compare to scaling popular distributed databases, such as Cassandra.
Weaviate overcomes the limitations of ANN libraries
One of the goals in designing Weaviate was to combine the speed and large-scale capabilities of ANN models with all the functionality we enjoy about databases. Thus, any object imported into Weaviate can immediately be queried — whether through a lookup by its id, a keyword search using the inverted index, or a vector search. This makes Weaviate a real-time vector database. And because Weaviate also uses an ANN model under the hood, the vector search will be as fast as with a vector library.
Additionally, objects can be updated or deleted at will and new objects can be added at any time — even while querying. This means Weaviate supports both full CRUD capabilities, as well as mutability of its indices. Last but not least, every single write is persisted, if your machine crashes or Weaviate is otherwise forced to interrupt its operations, it will simply continue where you left off after restarting. There is no scenario where you import for hours on end only to lose all progress due to a crash at the last minute.
How can Weaviate achieve all of this?
Weaviate is built around the idea of modularity. This also boils down to the ANN vector index models that Weaviate supports. At the time of writing this article in early 2021, the first vector index type that's supported is HNSW. By choosing this particular type, one of the limitations is already overcome: HNSW supports querying while inserting. This is a good basis for mutability, but it's not all.
Existing HNSW libraries fall short of full CRUD-support. Updating is not possible at all and deleting is only mimicked by marking an object as deleted without cleaning it up. Furthermore, the most popular library hnswlib only supports snapshotting, but not individual writes to disk.
To get to where Weaviate is today, a custom HNSW implementation was needed. It follows the same principles as outlined in this paper but extends it with more features. Each write is added to a write-ahead log. Additionally, since inserts into HNSW are not mutable by default, Weaviate internally assigns an immutable document ID that allows for updates. If an object is altered, Weaviate deletes the old doc id under the hood, assigns a new one, and reimports the object into the vector index. Last, but not least, a tombstoning approach — inspired by Cassandra — is used to handle deletes. Any incoming delete — whether an explicit delete or an implicit delete through an update — leads to marking a document ID as deleted ("attaching a tombstone"). It is therefore immediately hidden on future query results. Then — and this is where Weaviate's custom HNSW implementation differs from hnswlib — an asynchronous cleanup process rebuilds the affected parts of the index and removes the tombstoned elements for good. This keeps the index fresh at all times while saving a lot of computing resources due to a well-tuned bulk cleanup process that is considerably faster than individual cleanups would be.
It is the combination of the above that makes Weaviate the perfect solution for handling your vector search needs in production.
Choose what's right for your use case — and try it out yourself!
Now that you've learned about some of the limitations imposed by popular ANN libraries and how Weaviate helps you overcome them, there is just one question left to answer: When should you choose which? If you know for certain that you will never need to update your data, will never add a new data point, and don't need real-time abilities, a library like the ones mentioned above will be a good solution. But if you want to update your data, import additional objects even while queries are occurring, and not sacrifice real-time or persistence, then you should take a look at Weaviate.
Check out the quickstart guide in the documentation or learn more about Weaviate here. If you like what you've read and would consider using Weaviate in the future, feel free to leave a star on GitHub.
Check out the Getting Started with Weaviate and begin building amazing apps with Weaviate.
You can reach out to us on Slack or Twitter.
Weaviate is open source, you can follow the project on GitHub. Don't forget to give us a ⭐️ while you are there.