The AI-First Database Ecosystem

This is the description, it describes what the post is about

hero image
Bob van Luijt
Bob van Luijt
Share:

A new ecosystem of smaller companies is ushering in a “third wave” of AI-first database technology. New search engines and databases brilliantly answer queries posed in natural language, but their machine-learning models are not limited to text searches. The same approach can also be used to search anything from images to DNA.

Much of the software involved is open source, so it functions transparently and users can customize it to meet their specific needs. Clients can retain control of their data, keeping it safely behind their own firewalls.

How We Got Here

First-wave database technology is often called by the acronym SQL—the initials of a ubiquitous query language used to manage relational databases, which are conceptually similar to spreadsheets or tables. Throughout the 1980s, this technology was dominated by companies like Oracle and Microsoft.

The second wave of databases is called “NoSQL”. These are the domain of companies like MongoDB. They store data in different ways, for example, key-value stores, document stores, wide-column stores and graph databases, but what they all have in common is that they’re not relational tables. Among a long list of capabilities, first- and second-wave databases have their strengths. For example, some are very good at finding every instance of a certain value in a database, and others are very good at storing time sequences.

The third wave of database technologies focuses on data that is processed by a machine learning model first, where the AI models help in processing, storing and searching through the data as opposed to traditional ways.

To better understand the concept, think of a supermarket with 50,000 items. Items on display are not organized alphabetically or by price, the way you’d expect a structured, digital system to do it; they’re placed in context. You find things in the supermarket by understanding how they relate to each other. So if the store gets a new product—say, guavas—you know to look near the apples and bananas, not near garbage bags or other things that happen to also cost $1.98/lb.

A key early milestone in the third wave happened in 2015 when Google changed its search algorithm from one based on page rankings to one based on a machine learning model that it dubbed RankBrain. Before then, Google’s search engine was essentially a high-powered keyword search that ranked websites by the number of other sites that linked back to them. Essentially, Google trusted rankings to the collective users of the Internet.

That “wisdom of the crowd” approach worked, but to improve the quality of the results it returned, Google needed RankBrain to “understand” the text it searched through. So, it used machine learning to vectorize (the process happening inside machine learning models such as transformers) text on sites and in links.

Returning to the grocery store for a moment, the challenge comes because a grocery store is a three-dimensional space, but every significant word in unstructured text data needs to be related to hundreds of other words that it is frequently associated with. So, machine learning systems automatically classify text in hypervolumes—imaginary spaces with hundreds or even thousands of dimensions. For any given item in a database, those vectors form what’s known as a “representation” of the item.

Since it conveys both content and context, such a representation obviously presents a more complete and nuanced data picture. The challenge comes from searching through myriad dimensions. Initially, this was done with a brute force approach, looking at every vector associated with every entry. Needless to say, that approach didn’t scale.

One breakthrough that helped third-wave search engines to scale was an approach called “approximate nearest neighbor” (ANN) search. If we make one final trip to the supermarket to understand what a guava is, we could look at the things around it—other fruit. Somewhat further away we might find guava juice or tinned guavas, but there’s really no reason to look four aisles over for guava-flavored cat food.

Using ANN allowed high-dimensional searches with near-perfect accuracy to be returned in milliseconds instead of hours. To be practical, vector databases also needed something prosaically called CRUD support. That stands for “create, read, update and delete,” and solving that technical challenge meant that the complex process of indexing the database could be done once, rather than being repeated from scratch whenever the database was updated.

This leads to the simplest definition of a third wave database: A vector database stores data indexed by machine learning models. Different types of databases (e.g., vector search engines) allow users to search through these vectorized datasets and others (e.g., feature stores) allow users to store vectors on a large scale for later use.

We’re awash in unstructured data.

We’re living in a time of massive data accumulation and much, if not most of it, is unstructured: text, photos, video, audio files, as well as other things such as genetic information. Vector search is particularly good at extracting value from such data.

Tech giants like Google, AWS, or Microsoft Azure offer their vector search capabilities to customers willing to upload their data. But there’s now an ecosystem of newer companies with AI-first specific (often open-source) solutions and vector-search capabilities that customers can run on a SaaS basis or on their own systems.

The AI-first Database Ecosystem

The companies that make up this ecosystem provide specialized services that overlap to various degrees. Combined, four sub-groups make up the ecosystem.

  1. Embedding providers (e.g., Hugging Face or OpenAI)
  2. Neural framework (e.g., deepset or Jina)
  3. Feature stores (e.g., FeatureBase, FeatureForm or Tecton)
  4. Vector search engines (e.g., Weaviate or Vertex)

While the number of data that companies are collecting in their data warehouses keeps growing, the need for better, more efficient searches keeps growing too. The more data we collect, the more complex searching through it becomes. Thanks to the advances in machine learning in the past decade and the commoditization of AI-first database technologies, you can start using it in your business tomorrow.