Contextionary Vectorizer
Overview
The text2vec-contextionary
module enables Weaviate to obtain vectors locally using a lightweight model.
Key notes:
- This module is not available on Weaviate Cloud (WCD).
- Enabling this module will enable the
nearText
search operator. - This module is based on FastText and uses a weighted mean of word embeddings (WMOWE) to produce the vector.
- Available for multiple languages
text2vec-contextionary
As a lightweight model, it is well suited for testing purposes.
For production use cases, we recommend using other modules that use a more modern, transformer-based architecture.
Weaviate instance configuration
This module is not available on Weaviate Cloud.
Docker Compose file
To use text2vec-contextionary
, you must enable it in your Docker Compose file (e.g. docker-compose.yml
).
While you can do so manually, we recommend using the Weaviate configuration tool to generate the Docker Compose
file.
Parameters
Weaviate:
ENABLE_MODULES
(Required): The modules to enable. Includetext2vec-contextionary
to enable the module.DEFAULT_VECTORIZER_MODULE
(Optional): The default vectorizer module. You can set this totext2vec-contextionary
to make it the default for all classes.
Contextionary:
EXTENSIONS_STORAGE_MODE
: Location of storage for extensions to the ContextionaryEXTENSIONS_STORAGE_ORIGIN
: The host of the custom extension storageNEIGHBOR_OCCURRENCE_IGNORE_PERCENTILE
: this can be used to hide very rare words. If you set it to '5', this means the 5th percentile of words by occurrence are removed in the nearestNeighbor search (for example used in the GraphQL_additional { nearestNeighbors }
feature).ENABLE_COMPOUND_SPLITTING
: see here.
Example
This configuration enables text2vec-contextionary
, sets it as the default vectorizer, and sets the parameters for the Contextionary Docker container.
---
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.27.4
ports:
- 8080:8080
- 50051:50051
restart: on-failure:0
environment:
CONTEXTIONARY_URL: contextionary:9999
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
ENABLE_MODULES: 'text2vec-contextionary'
DEFAULT_VECTORIZER_MODULE: 'text2vec-contextionary'
CLUSTER_HOSTNAME: 'node1'
contextionary:
environment:
OCCURRENCE_WEIGHT_LINEAR_FACTOR: 0.75
EXTENSIONS_STORAGE_MODE: weaviate
EXTENSIONS_STORAGE_ORIGIN: http://weaviate:8080
NEIGHBOR_OCCURRENCE_IGNORE_PERCENTILE: 5
ENABLE_COMPOUND_SPLITTING: 'false'
image: cr.weaviate.io/semitechnologies/contextionary:en0.16.0-v1.2.1
ports:
- 9999:9999
...
Class configuration
You can configure how the module will behave in each class through the Weaviate schema.
Vectorization settings
You can set vectorizer behavior using the moduleConfig
section under each class and property:
Class-level
vectorizer
- what module to use to vectorize the data.vectorizeClassName
– whether to vectorize the class name. Default:true
.
Property-level
skip
– whether to skip vectorizing the property altogether. Default:false
vectorizePropertyName
– whether to vectorize the property name. Default:false
Example
{
"classes": [
{
"class": "Document",
"description": "A class called document",
"vectorizer": "text2vec-contextionary",
"moduleConfig": {
"text2vec-contextionary": {
"vectorizeClassName": false
}
},
"properties": [
{
"name": "content",
"dataType": [
"text"
],
"description": "Content that will be vectorized",
"moduleConfig": {
"text2vec-contextionary": {
"skip": false,
"vectorizePropertyName": false
}
}
}
],
}
]
}
Class/property names
If you are using this module and are vectorizing the class or property name, the name(s) must be a part of the text2vec-contextionary
.
To use multiple words as a class or property definition, concatenate them as:
- camel case (e.g.
bornIn
) for class or property names, or - snake case (e.g.
born_in
) for property names.
For example, the following are acceptable:
Publication
name
hasArticles
Article
title
summary
wordCount
url
hasAuthors
inPublication # CamelCase (all versions)
publication_date # snake_case (from v1.7.2 on)
Author
name
wroteArticles
writesFor
Usage example
This is an example of a nearText
query with text2vec-contextionary
.
- Python Client v4
- Python Client v3
- JS/TS Client v2
- Go
- Java
- Curl
- GraphQL
import weaviate
import weaviate.classes as wvc
from weaviate.classes.query import Move
import os
client = weaviate.connect_to_local()
try:
publications = client.collections.get("Publication")
response = publications.query.near_text(
query="fashion",
distance=0.6,
move_to=Move(force=0.85, concepts="haute couture"),
move_away=Move(force=0.45, concepts="finance"),
return_metadata=wvc.query.MetadataQuery(distance=True),
limit=2
)
for o in response.objects:
print(o.properties)
print(o.metadata)
finally:
client.close()
import weaviate
client = weaviate.Client("http://localhost:8080")
nearText = {
"concepts": ["fashion"],
"distance": 0.6, # prior to v1.14 use "certainty" instead of "distance"
"moveAwayFrom": {
"concepts": ["finance"],
"force": 0.45
},
"moveTo": {
"concepts": ["haute couture"],
"force": 0.85
}
}
result = (
client.query
.get("Publication", "name")
.with_additional(["certainty OR distance"]) # note that certainty is only supported if distance==cosine
.with_near_text(nearText)
.do()
)
print(result)
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});
const response = await client.graphql
.get()
.withClassName('Publication')
.withFields('name _additional{certainty distance}') // note that certainty is only supported if distance==cosine
.withNearText({
concepts: ['fashion'],
distance: 0.6, // prior to v1.14 use certainty instead of distance
moveAwayFrom: {
concepts: ['finance'],
force: 0.45,
},
moveTo: {
concepts: ['haute couture'],
force: 0.85,
},
})
.do();
console.log(response);
package main
import (
"context"
"fmt"
"github.com/weaviate/weaviate-go-client/v4/weaviate"
"github.com/weaviate/weaviate-go-client/v4/weaviate/graphql"
)
func main() {
cfg := weaviate.Config{
Host: "localhost:8080",
Scheme: "http",
}
client, err := weaviate.NewClient(cfg)
if err != nil {
panic(err)
}
className := "Publication"
name := graphql.Field{Name: "name"}
_additional := graphql.Field{
Name: "_additional", Fields: []graphql.Field{
{Name: "certainty"}, // only supported if distance==cosine
{Name: "distance"}, // always supported
},
}
concepts := []string{"fashion"}
distance := float32(0.6)
moveAwayFrom := &graphql.MoveParameters{
Concepts: []string{"finance"},
Force: 0.45,
}
moveTo := &graphql.MoveParameters{
Concepts: []string{"haute couture"},
Force: 0.85,
}
nearText := client.GraphQL().NearTextArgBuilder().
WithConcepts(concepts).
WithDistance(distance). // use WithCertainty(certainty) prior to v1.14
WithMoveTo(moveTo).
WithMoveAwayFrom(moveAwayFrom)
ctx := context.Background()
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(name, _additional).
WithNearText(nearText).
Do(ctx)
if err != nil {
panic(err)
}
fmt.Printf("%v", result)
}
package io.weaviate;
import io.weaviate.client.Config;
import io.weaviate.client.WeaviateClient;
import io.weaviate.client.base.Result;
import io.weaviate.client.v1.graphql.model.GraphQLResponse;
import io.weaviate.client.v1.graphql.query.argument.NearTextArgument;
import io.weaviate.client.v1.graphql.query.argument.NearTextMoveParameters;
import io.weaviate.client.v1.graphql.query.fields.Field;
public class App {
public static void main(String[] args) {
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
NearTextMoveParameters moveTo = NearTextMoveParameters.builder()
.concepts(new String[]{ "haute couture" }).force(0.85f).build();
NearTextMoveParameters moveAway = NearTextMoveParameters.builder()
.concepts(new String[]{ "finance" }).force(0.45f)
.build();
NearTextArgument nearText = client.graphQL().arguments().nearTextArgBuilder()
.concepts(new String[]{ "fashion" })
.distance(0.6f) // use .certainty(0.7f) prior to v1.14
.moveTo(moveTo)
.moveAwayFrom(moveAway)
.build();
Field name = Field.builder().name("name").build();
Field _additional = Field.builder()
.name("_additional")
.fields(new Field[]{
Field.builder().name("certainty").build(), // only supported if distance==cosine
Field.builder().name("distance").build(), // always supported
}).build();
Result<GraphQLResponse> result = client.graphQL().get()
.withClassName("Publication")
.withFields(name, _additional)
.withNearText(nearText)
.run();
if (result.hasErrors()) {
System.out.println(result.getError());
return;
}
System.out.println(result.getResult());
}
}
# Note: Under nearText, use `certainty` instead of distance prior to v1.14
# Under _additional, `certainty` is only supported if distance==cosine, but `distance` is always supported
echo '{
"query": "{
Get {
Publication(
nearText: {
concepts: [\"fashion\"],
distance: 0.6,
moveAwayFrom: {
concepts: [\"finance\"],
force: 0.45
},
moveTo: {
concepts: [\"haute couture\"],
force: 0.85
}
}
) {
name
_additional {
certainty
distance
}
}
}
}"
}' | curl \
-X POST \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer learn-weaviate' \
-H "X-OpenAI-Api-Key: $OPENAI_API_KEY" \
-d @- \
https://edu-demo.weaviate.network/v1/graphql
{
Get{
Publication(
nearText: {
concepts: ["fashion"],
distance: 0.6 # prior to v1.14 use "certainty" instead of "distance"
moveAwayFrom: {
concepts: ["finance"],
force: 0.45
},
moveTo: {
concepts: ["haute couture"],
force: 0.85
}
}
){
name
_additional {
certainty # only supported if distance==cosine.
distance # always supported
}
}
}
}
Additional information
Find concepts
To find concepts or words or to check if a concept is part of the Contextionary, use the v1/modules/text2vec-contextionary/concepts/<concept>
endpoint.
GET /v1/modules/text2vec-contextionary/concepts/<concept>
Parameters
The only parameter concept
is a string that should be camelCased in case of compound words or a list of words.
Response
The result contains the following fields:
"individualWords"
: a list of the results of individual words or concepts in the query, which contains:"word"
: a string of requested concept or single word from the concept."present"
: a boolean value which istrue
if the word exists in the Contextionary."info"
: an object with the following fields:""nearestNeighbors"
: a list with the nearest neighbors, containing"word"
and"distance"
(between the two words in the high dimensional space). Note that"word"
can also be a data object."vector"
: the raw 300-long vector value.
"concatenatedWord"
: an object of the concatenated concept."concatenatedWord"
: the concatenated word if the concept given is a camelCased word."singleWords"
: a list of the single words in the concatenated concept."concatenatedVector"
: a list of vector values of the concatenated concept."concatenatedNearestNeighbors"
: a list with the nearest neighbors, containing"word"
and"distance"
(between the two words in the high dimensional space). Note that"word"
can also be a data object.
Example
curl http://localhost:8080/v1/modules/text2vec-contextionary/concepts/magazine
or (note the camelCased compound concept)
- Python
- JS/TS Client v2
- Go
- Java
- Curl
import weaviate
client = weaviate.Client("http://localhost:8080")
concept_info = client.contextionary.get_concept_vector("fashionMagazine")
print(concept_info)
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});
const response = await client.c11y
.conceptsGetter()
.withConcept('fashionMagazine')
.do();
console.log(JSON.stringify(response, null, 2));
package main
import (
"context"
"fmt"
"github.com/weaviate/weaviate-go-client/v4/weaviate"
)
func main() {
cfg := weaviate.Config{
Host: "localhost:8080",
Scheme: "http",
}
client, err := weaviate.NewClient(cfg)
if err != nil {
panic(err)
}
concept, err := client.C11y().
ConceptsGetter().
WithConcept("fashionMagazine").
Do(context.Background())
if err != nil {
panic(err)
}
fmt.Printf("%v", concept)
}
package io.weaviate;
import io.weaviate.client.Config;
import io.weaviate.client.WeaviateClient;
import io.weaviate.client.base.Result;
import io.weaviate.client.v1.contextionary.model.C11yWordsResponse;
public class App {
public static void main(String[] args) {
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
Result<C11yWordsResponse> result = client.c11y().conceptsGetter().withConcept("fashionMagazine").run();
if (result.hasErrors()) {
System.out.println(result.getError());
return;
}
System.out.println(result.getResult());
}
}
curl http://localhost:8080/v1/modules/text2vec-contextionary/concepts/fashionMagazine
with a result similar to:
{
"individualWords": [
{
"inC11y": true,
"info": {
"nearestNeighbors": [
{
"word": "magazine"
},
{
"distance": 6.186641,
"word": "editorial"
},
{
"distance": 6.372504,
"word": "featured"
},
{
"distance": 6.5695524,
"word": "editor"
},
{
"distance": 7.0328364,
"word": "titled"
},
...
],
"vector": [
0.136228,
0.706469,
-0.073645,
-0.099225,
0.830348,
...
]
},
"word": "magazine"
}
]
}
Model details
text2vec-contextionary
(Contextionary) is Weaviate's own language vectorizer that is trained using fastText on Wiki and CommonCrawl data.
The text2vec-contextionary
model outputs a 300-dimensional vector. This vector is computed by using a Weighted Mean of Word Embeddings (WMOWE) technique.
The vector is calculated based on the centroid of the words weighted by the occurrences of the individual words in the original training text-corpus (e.g., the word "has"
is seen as less important than the word "apples"
).
Available languages
Contextionary models are available for the following languages:
- Trained with on CommonCrawl and Wiki, using GloVe
- English
- Dutch
- German
- Czech
- Italian
- Trained on Wiki
- English
- Dutch
Extending the Contextionary
Custom words or abbreviations (i.e., "concepts") can be added to text2vec-contextionary
through the v1/modules/text2vec-contextionary/extensions/
endpoint.
Using this endpoint will enrich the Contextionary with your own words, abbreviations or concepts in context by transfer learning. Using the v1/modules/text2vec-contextionary/extensions/
endpoint adds or updates the concepts in real-time.
Note that you need to introduce the new concepts in to Weaviate before adding the data, as this will note cause Weaviate to automatically update the vectors.
Parameters
A body (in JSON or YAML) with the extension word or abbreviation you want to add to the Contextionary with the following fields includes a:
"concept"
: a string with the word, compound word or abbreviation"definition"
: a clear description of the concept, which will be used to create the context of the concept and place it in the high dimensional Contextionary space."weight"
: a float with the relative weight of the concept (default concepts in the Contextionary have a weight of 1.0)
Response
The same fields as the input parameters will be in the response body if the extension was successful.
Example
Let's add the concept "weaviate"
to the Contextionary.
- Python
- JS/TS Client v2
- Go
- Java
- Curl
import weaviate
client = weaviate.Client("http://localhost:8080")
client.contextionary.extend("weaviate", "Open source cloud native real time vector database", 1.0)
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});
const response = await client.c11y
.extensionCreator()
.withConcept('weaviate')
.withDefinition('Open source cloud native real time vector database')
.withWeight(1)
.do();
console.log(JSON.stringify(response, null, 2));
package main
import (
"context"
"github.com/weaviate/weaviate-go-client/v4/weaviate"
)
func main() {
cfg := weaviate.Config{
Host: "localhost:8080",
Scheme: "http",
}
client, err := weaviate.NewClient(cfg)
if err != nil {
panic(err)
}
err := client.C11y().ExtensionCreator().
WithConcept("weaviate").
WithDefinition("Open source cloud native real time vector database").
WithWeight(1.0).
Do(context.Background())
if err != nil {
panic(err)
}
}
package io.weaviate;
import io.weaviate.client.Config;
import io.weaviate.client.WeaviateClient;
import io.weaviate.client.base.Result;
public class App {
public static void main(String[] args) {
Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
Result<Boolean> result = client.c11y().extensionCreator()
.withConcept("weaviate")
.withDefinition("Open source cloud native real time vector database")
.withWeight(1.0f)
.run();
if (result.hasErrors()) {
System.out.println(result.getError());
return;
}
System.out.println(result.getResult());
}
}
curl \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"concept": "weaviate",
"definition": "Open source cloud native real time vector database",
"weight": 1
}' \
http://localhost:8080/v1/modules/text2vec-contextionary/extensions
You can always check if the new concept exists in the Contextionary:
curl http://localhost:8080/v1/modules/text2vec-contextionary/concepts/weaviate
Note that it is not (yet) possible to extend the Contextionary with concatenated words or concepts consisting of more than one word.
You can also overwrite current concepts with this endpoint. Let's say you are using the abbreviation API
for Academic Performance Index
instead of Application Programming Interface
, and you want to reposition this concept in the Contextionary:
curl \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"concept": "api",
"definition": "Academic Performance Index a measurement of academic performance and progress of individual schools in California",
"weight": 1
}' \
http://localhost:8080/v1/modules/text2vec-contextionary/extensions
The meaning of the concept API
has now changed in your Weaviate setting.
Stopwords
Note that stopwords are automatically removed from camelCased and CamelCased names.
Vectorization behavior
Stopwords can be useful, so we don't want to encourage you to leave them out completely. Instead Weaviate will remove them during vectorization.
In most cases you won't even notice that this happens in the background, however, there are a few edge cases that might cause a validation error:
If your camelCased class or property name consists only of stopwords, validation will fail. Example:
TheInA
is not a valid class name, however,TheCarInAField
is (and would internally be represented asCarField
).If your keyword list contains stop words, they will be removed. However, if every single keyword is a stop word, validation will fail.
How does Weaviate decide whether a word is a stop word or not?
The list of stopwords is derived from the Contextionary version used and is published alongside the Contextionary files.
Compound splitting
Sometimes Weaviate's Contextionary does not understand words which are compounded out of words it would otherwise understand. This impact is far greater in languages that allow for arbitrary compounding (such as Dutch or German) than in languages where compounding is not very common (such as English).
Effect
Imagine you import an object of class Post
with content This is a thunderstormcloud
. The arbitrarily compounded word thunderstormcloud
is not present in the Contextionary. So your object's position will be made up of the only words it recognizes: "post", "this"
("is"
and "a"
are removed as stopwords).
If you check how this content was vectorized using the _interpretation
feature, you will see something like the following:
"_interpretation": {
"source": [
{
"concept": "post",
"occurrence": 62064610,
"weight": 0.3623903691768646
},
{
"concept": "this",
"occurrence": 932425699,
"weight": 0.10000000149011612
}
]
}
To overcome this limitation the optional Compound Splitting Feature can be enabled in the Contextionary. It will understand the arbitrary compounded word and interpret your object as follows:
"_interpretation": {
"source": [
{
"concept": "post",
"occurrence": 62064610,
"weight": 0.3623903691768646
},
{
"concept": "this",
"occurrence": 932425699,
"weight": 0.10000000149011612
},
{
"concept": "thunderstormcloud (thunderstorm, cloud)",
"occurrence": 5756775,
"weight": 0.5926488041877747
}
]
}
Note that the newly found word (made up of the parts thunderstorm
and cloud
has the highest weight in the vectorization. So this meaning, which would have been lost without Compound Splitting, can now be recognized.
How to enable
You can enable Compound Splitting in the Docker Compose file of the text2vec-contextionary
. See how this is done here.
Trade-Off Import speed vs Word recognition
Compound Splitting runs an any word that is otherwise not recognized. Depending on your dataset, this can lead to a significantly longer import time (up to 100% longer). Therefore, you should carefully evaluate whether the higher precision in recognition or the faster import times are more important to your use case. As the benefit is larger in some languages (e.g. Dutch, German) than in others (e.g. English) this feature is turned off by default.
Noise filtering
So called "noise words" are concatenated words of random words with no easily recognizable meaning. These words are present in the Contextionary training space, but are extremely rare and therefore distributed seemingly randomly. As a consequence, an "ordinary" result of querying features relying on nearest neighbors (additional properties nearestNeighbors
or semanticPath
) might contain such noise words as immediate neighbors.
To combat this noise, a neighbor filtering feature was introduced in the contextionary, which ignores words of the configured bottom percentile - ranked by occurrence in the respective training set. By default this value is set to the bottom 5th percentile. This setting can be overridden. To set another value, e.g. to ignore the bottom 10th percentile, provide the environment variable NEIGHBOR_OCCURRENCE_IGNORE_PERCENTILE=10
to the text2vec-contextionary
container, in the Docker Compose file.
Model license(s)
The text2vec-contextionary
module is based on the fastText
library, which is released under the MIT license. See the license file for more information.
It is your responsibility to evaluate whether the terms of its license(s), if any, are appropriate for your intended use.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.