Skip to main content

Data structure in Weaviate

Overview


You've seen some of the powerful queries that Weaviate enables. But how does Weaviate actually store data such that it can supports these queries?

In this section, we'll take a look at some of the key components that allow Weaviate to perform these queries at speed. In particular, we'll take a look at indexes, which are the backbone of Weaviate's data structure, and the schema, which acts as a blueprint for your data.

Indexes

An index is a data structure that allows for efficient retrieval of data. In Weaviate, there are two main indexes: the inverted index and the vector index.

The inverted index is the kind of index that you may be familiar with. You can think of it as a reference table that for example allows you to quickly look up a term and find objects that contain that term.

The vector index allows for efficient retrieval of vectors based on similarity. This is the index that allows Weaviate to perform vector searches fast. Let's dig in a little more.

Inverted index

An inverted index deconstructs text into a set of constituent terms and stores them in a data structure, such as a hash table. Take, for example, a data object containing the text "Conference (Chicago, IL)".

The user might want to search for this object based on any of the contained terms such as "conference", "Chicago", or "IL". The inverted index allows Weaviate to quickly retrieve the ID of the object containing the term.

This is done by mapping the object ID to "tokens", where a token is a term that has been extracted from the text. By default, Weaviate uses a word tokenization, where only alpha-numeric characters are kept, converted into lowercase, and then split into tokens based on whitespace.

So an input text Conference (Chicago, IL) is indexed by three tokens: conference, chicago, il.

We will cover more about different available tokenization methods later on.

Vector index

Each object in Weaviate can be associated with a vector. These vectors are what enables similarity searches that you have seen before. As we mentioned, however, brute-force similarity searches are computationally expensive, as well as growing linearly with the size of the dataset.

To tackle this problem Weaviate uses vector indexes that utilize an Approximate Nearest Neighbor (ANN) algorithm. The ANN algorithm enables each vector index to organize a set of vectors, so that similar ones to a query can be retrieved at lightning-fast speeds. Weaviate currently uses an HNSW-based ANN index.

Each set of vectors are said to reside in a "vector space", indicating that it is a multi-dimensional "space" in which vectors are placed.

Classes

What is a class?

A class in Weaviate is a collection of objects of the same type. Each object in Weaviate must belong to a class, and one class only.

Imagine that you are storing a set of quiz items from the game show Jeopardy! in Weaviate. A good way to structure it would be to have an object represent a question including all associated attributes, such as the answer, what round it was from, how many points it was worth, when it aired on TV, and so on.

So, a good way to represent this data would be through a class called JeopardyQuestion, which would contain a set of objects, each object representing one such question.

Class names are singular by convention

This is as they refer to individual objects, e.g. a JeopardyQuestion object.

What is in a class?

As we mentioned, each Jeopardy! question would contain multiple related, but distinct, attributes such as the question, answer, round, points, and so on. These are reflected in each class object in Weaviate as a set of properties, such as a question property, an answer property, and so on.

How many vectors per object?

Each object is represented by one vector, and each class has one vector index. This means that all objects in the class will be associated with the same vector index.

In other words, all objects in the class will be stored in what is called the same vector space. This is important to keep in mind when designing your data schema in Weaviate. A vector search can only be performed within a single vector space, for reasons that a vector of different lengths, or even those of the same length but with different meanings, cannot be compared.

Going back to our color analogy that you saw earlier - you wouldn't be able to compare an RGB value to an CMYK value, right? The same applies to vector embeddings that represent text.

So in Weaviate, a vector search can only search one class at a time. As a result, it is important to design your schema such that objects that you want to search together are in the same class.

Schema

A schema in Weaviate is the blueprint that defines its data structure. It does so for each class of objects, which are collections of objects of the same type.

Here is an example schema structure:

Example schema
{
"classes": [
{
"class": "Article",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": true
}
},
"properties": [
{
"dataType": [
"text"
],
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "title",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "body",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 32,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine",
"pq": {
"enabled": false,
"segments": 0,
"centroids": 256,
"encoder": {
"type": "kmeans",
"distribution": "log-normal"
}
}
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}
]
}

This is a lot of information, and can be quite intimidating. Let's break it down.

First of all, you see that the first level key in the object is classes, which contains a list of classes. In this case, there is only one class, Article.

The schema specifies for each class:

  • The metadata such as its name (class),
  • Its data properties,
  • The vectorizer,
  • Module configurations (moduleConfig),
  • The index configurations (for inverted invertedIndexConfig and vector vectorIndexConfig indexes),
  • and more.
Auto-schema

Any missing information required for schema definition will be automatically inferred by Weaviate based on default values and the imported data.

Review

Review exercise

  Question
What is the function of an inverted index in Weaviate?
  Question
What does the vector index in Weaviate enable?
  Question
What is a class in Weaviate?
  Question
What is the function of the schema in Weaviate?

Key takeaways

  • Weaviate stores data using two main indexes: the inverted index and the vector index.
  • A class in Weaviate represents a collection of objects of the same type, and each object in Weaviate must belong to a single class.
  • A schema is the blueprint that defines Weaviate's data structure.
  • Vector searches can only be performed within a single vector space.
    • So, any objects you want to search together should be in the same class.
  • Any missing information required for schema definition will be automatically inferred by Weaviate based on default values and the imported data.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.