Ref2Vec Centroid Vectorizer
Introduction
The ref2Vec-centroid
module is used to calculate object vectors based on the centroid of referenced vectors. The idea is that this centroid vector would be calculated from the vectors of an object's references, enabling associations between clusters of objects. This is useful in applications such as making suggestions based on the aggregation of a user's actions or preferences.
How to enable
Weaviate Cloud
This module is enabled by default on the WCD.
Weaviate open source
Which modules to use in a Weaviate instance can be specified in the Docker Compose
file. Ref2Vec-centroid can be added like this:
---
services:html
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.28.0
ports:
- 8080:8080
- 50051:50051
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
ENABLE_MODULES: 'ref2vec-centroid'
CLUSTER_HOSTNAME: 'node1'
...
How to configure
In your Weaviate schema, you must define how you want this module to vectorize your data. If you are new to Weaviate schemas, you might want to check out the tutorial on the Weaviate schema first.
For example, here is an Article
class which is configured to use ref2vec-centroid. Doing so requires only a class-level moduleConfig
, containing two fields:
referenceProperties
: a list of the class' reference properties which should be used during the calculation of the centroid.method
: the method by which the centroid is calculated. Currently onlymean
is supported.
The Article
class specifies its hasParagraphs
property as the only reference property to be used in the calculation of an Article
object's vector.
It is important to note that unlike the other vectorizer modules (e.g. text2vec/multi2vec/img2vec), ref2vec-centroid does not generate embeddings based on the contents of an object. Rather, the point of this module is to calculate an object's vector based on vectors of its references.
In this case, the Paragraph
class is configured to generate vectors using the text2vec-contextionary module. Thus, the vector representation of the Article
class is an average of text2vec-contextionary vectors sourced from referenced Paragraph
instances.
Although this example uses text2vec-contextionary to generate vectors for the Paragraph
class, ref2vec-centroid's behavior remains identical for user-provided vectors. In such a case, ref2vec-centroid's output will still be calculated as an average of the reference vectors; the only difference being the provenance of the reference vectors.
{
"classes": [
{
"class": "Article",
"description": "A class representing a published article",
"moduleConfig": {
"ref2vec-centroid": {
"referenceProperties": ["hasParagraphs"],
"method": "mean"
}
},
"properties": [
{
"dataType": [
"text"
],
"description": "Title of the article",
"name": "title"
}
,
{
"dataType": [
"Paragraph"
],
"description": "Paragraphs belonging to this article",
"name": "hasParagraphs"
}
],
"vectorizer": "ref2vec-centroid"
},
{
"class": "Paragraph",
"description": "Paragraphs belonging to an Article",
"properties": [
{
"dataType": [
"text"
],
"description": "Content that will be vectorized",
"moduleConfig": {
"text2vec-contextionary": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "content"
}
],
"vectorizer": "text2vec-contextionary"
}
]
}
How to use
Now that the Article
class is properly configured to use the ref2vec-centroid module, we can begin to create some objects. If there are not yet any Paragraph
objects to reference, or if we simply don't want to reference a Paragraph
object yet, any newly created Article
object will have its vector set to nil
.
Once we are ready to reference one or more existing Paragraph
objects (with non-nil vectors), our Article
object will automatically be assigned a centroid vector, calculated using the vectors from all the Paragraph
objects which are referenced by our Article
object.
Updating the centroid
An object whose class is configured to use ref2vec-centroid will have its vector calculated (or recalculated) as a result of these events:
- Creating the object with references already assigned as properties
- Object
POST
: create a single new object with references - Batch object
POST
: create multiple objects at once, each with references
- Object
- Updating an existing object's list of references. Note that this can happen several ways:
- Object
PUT
: update all of the object's properties with a new set of references. This totally replaces the object's existing reference list with the newly provided one - Object
PATCH
: update an existing object by adding any newly provided reference(s) to the object's existing reference list - Reference
POST
: create a new reference to an existing object - Reference
PUT
: update all of the object's references
- Object
- Deleting references from the object. Note that this can happen several ways:
- Object
PUT
: update all of the object's properties, removing all references - Reference
DELETE
: delete an existing reference from the object's list of references
- Object
Note: Adding references in batches is not currently supported. This is because the batch reference feature is specifically built to avoid the cost of updating the vector index. If this is an important use case for you, open a feature request on GitHub.
Making queries
This module can be used with the existing nearVector and nearObject
filters. It does not add any additional GraphQL extensions like nearText
.
Additional information
It is important to note that updating a referenced object will not automatically trigger an update to the referencing object's vector.
In other words, using our Article
/Paragraph
example:
Let's say an Article
object, "On the Philosophy of Modern Ant Colonies"
, references three Paragraph
objects: "intro"
, "body"
, and "conclusion"
. Over time, "body"
may be updated as more research has been conducted on the dynamic between worker ants and soldier ants. In this case, the existing vector for the article will not be updated with a new vector based on the refactored "body"
.
If we want "On the Philosophy of Modern Ant Colonies"
's centroid vector to be recalculated, we would need to otherwise trigger an update. For example, we could either remove the reference to "body"
and add it back, or simply PUT
the Article
object with an identical object.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.