Core Knowledge
Getting started

Installation
Configuration
Client libraries

Schema

GraphQL references
RESTful API references

Modules

    Roadmap
    Architecture
    Vector Index (ANN) Plugins
    Benchmarks

    Tutorials
    More resources

    Named Entity Recognition

    Weaviate on Stackoverflow badge Weaviate issues on Github badge Weaviate v1.15.2 version badge Weaviate v1.15.2 version badge Weaviate total Docker pulls badge


    In short

    • The Named Entity Recognition (NER) module is a Weaviate module for token classification.
    • The module depends on a NER Transformers model that should be running with Weaviate. There are pre-built models available, but you can also attach another HuggingFace Transformer or custom NER model.
    • The module adds a tokens {} filter to the GraphQL _additional {} field.
    • The module returns data objects as usual, with recognized tokens in the GraphQL _additional { tokens {} } field.

    Introduction

    Named Entity Recognition (NER) module is a Weaviate module to extract entities from your existing Weaviate (text) objects on the fly. Entity Extraction happens at query time. Note that for maximum performance, transformer-based models should run with GPUs. CPUs can be used, but the throughput will be lower.

    There are currently three different NER modules available (taken from Huggingface): dbmdz-bert-large-cased-finetuned-conll03-english, dslim-bert-base-NER, davlan-bert-base-multilingual-cased-ner-hrl.

    How to enable (module configuration)

    Docker-compose

    The NER module can be added as a service to the Docker-compose file. You must have a text vectorizer like text2vec-contextionary or text2vec-transformers running. An example Docker-compose file for using the ner-transformers module (dbmdz-bert-large-cased-finetuned-conll03-english) in combination with the text2vec-contextionary:

    ---
    version: '3.4'
    services:
      weaviate:
        command:
        - --host
        - 0.0.0.0
        - --port
        - '8080'
        - --scheme
        - http
        image: semitechnologies/weaviate:1.15.2
        ports:
        - 8080:8080
        restart: on-failure:0
        environment:
          CONTEXTIONARY_URL: contextionary:9999
          NER_INFERENCE_API: "http://ner-transformers:8080"
          QUERY_DEFAULTS_LIMIT: 25
          AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
          PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
          DEFAULT_VECTORIZER_MODULE: 'text2vec-contextionary'
          ENABLE_MODULES: 'text2vec-contextionary,ner-transformers'
          CLUSTER_HOSTNAME: 'node1'
      contextionary:
        environment:
          OCCURRENCE_WEIGHT_LINEAR_FACTOR: 0.75
          EXTENSIONS_STORAGE_MODE: weaviate
          EXTENSIONS_STORAGE_ORIGIN: http://weaviate:8080
          NEIGHBOR_OCCURRENCE_IGNORE_PERCENTILE: 5
          ENABLE_COMPOUND_SPLITTING: 'false'
        image: semitechnologies/contextionary:en0.16.0-v1.0.2
        ports:
        - 9999:9999
      ner-transformers:
        image: semitechnologies/ner-transformers:dbmdz-bert-large-cased-finetuned-conll03-english
    ...
    

    Variable explanations:

    • NER_INFERENCE_API: where the qna module is running

    How to use (GraphQL)

    To make use of the modules capabilities, simply extend your query with the following new _additional property:

    GraphQL Token

    This module adds a search filter to the GraphQL _additional field in queries: token{}. This new filter takes the following arguments:

    FieldData TypeRequiredExample valueDescription
    propertieslist of stringsyes["summary"]The properties of the queries Class which contains text (text or string Datatype). You must provide at least one property
    certaintyfloatno0.75Desired minimal certainty or confidence that the recognized token must have. The higher the value, the stricter the token classification. If no certainty is set, all tokens that are found by the model will be returned.
    limitintno1The maximum amount of tokens returned per data object in total.

    Example query

      {
      Get {
        Article(
          limit: 1
        ) {
          title
          _additional{
            tokens(
              properties: ["title"],
              limit: 10,
              certainty: 0.7
            ) {
              certainty
              endPosition
              entity
              property
              startPosition
              word
            }
          }
        }
      }
    }
    
      import weaviate
    
    client = weaviate.Client("http://localhost:8080")
    
    result = (
      client.query
      .get('Article', ['title', '_additional {tokens ( properties: ["title"], limit: 1, certainty: 0.7) {entity property word certainty startPosition endPosition }}'])
      .do()
    )
    
    print(result)
    
      const weaviate = require("weaviate-client");
    
    const client = weaviate.client({
      scheme: 'http',
      host: 'localhost:8080',
    });
    
    client.graphql
          .get()
          .withClassName('Article')
          .withFields('title _additional {tokens ( properties: [\'title\'], limit: 1, certainty: 0.7) {entity property word certainty startPosition endPosition} }')
          .do()
          .then(res => {
            console.log(res)
          })
          .catch(err => {
            console.error(err)
          });
    
      package main
    
    import (
      "context"
      "fmt"
    
      "github.com/semi-technologies/weaviate-go-client/v4/weaviate"
      "github.com/semi-technologies/weaviate-go-client/v4/weaviate/graphql"
    )
    
    func main() {
      cfg := weaviate.Config{
        Host:   "localhost:8080",
        Scheme: "http",
      }
      client := weaviate.New(cfg)
    
      className := "Article"
      fields := []graphql.Field{
        {Name: "title"},
        {Name: "_additional", Fields: []graphql.Field{
          {Name: "tokens(properties: [\"title\"], limit: 1, certainty: 0.7)", Fields: []graphql.Field{
            {Name: "entity"},
            {Name: "property"},
            {Name: "word"},
            {Name: "certainty"},
            {Name: "startPosition"},
            {Name: "endPosition"},
          }},
        }},
      }
    
      result, err := client.GraphQL().Get().
        WithClassName(className).
        WithFields(fields...).
        Do(context.Background())
    
      if err != nil {
        panic(err)
      }
      fmt.Printf("%v", result)
    }
    
      package technology.semi.weaviate;
    
    import technology.semi.weaviate.client.Config;
    import technology.semi.weaviate.client.WeaviateClient;
    import technology.semi.weaviate.client.base.Result;
    import technology.semi.weaviate.client.v1.graphql.model.GraphQLResponse;
    import technology.semi.weaviate.client.v1.graphql.query.fields.Field;
    
    public class App {
      public static void main(String[] args) {
        Config config = new Config("http", "localhost:8080");
        WeaviateClient client = new WeaviateClient(config);
    
        Field title = Field.builder().name("title").build();
        Field _additional = Field.builder()
          .name("_additional")
          .fields(new Field[]{
            Field.builder()
              .name("tokens (properties: [\"title\"], limit: 1, certainty: 0.7)")
              .fields(new Field[]{
                Field.builder().name("entity").build(),
                Field.builder().name("property").build(),
                Field.builder().name("word").build(),
                Field.builder().name("certainty").build(),
                Field.builder().name("startPosition").build(),
                Field.builder().name("endPosition").build()
              }).build()
          }).build();
    
        Result<GraphQLResponse> result = client.graphQL().get()
          .withClassName("Article")
          .withFields(title, _additional)
          .run();
    
        if (result.hasErrors()) {
          System.out.println(result.getError());
          return;
        }
        System.out.println(result.getResult());
      }
    }
    
      $ echo '{ 
      "query": "{
        Get {
          Article(
            limit: 1
          ) {
            title
            _additional{
              tokens(
                properties: [\"title\"],
                limit: 10,
                certainty: 0.7
              ) {
                certainty
                endPosition
                entity
                property
                startPosition
                word
              }
            }
          }
        }
      }"
    }' | curl \
        -X POST \
        -H 'Content-Type: application/json' \
        -d @- \
        http://localhost:8080/v1/graphql
    

    🟢 Click here to try out this graphql example in the Weaviate Console.

    GraphQL response

    The answer is contained in a new GraphQL _additional property called tokens, which returns a list of tokens. It contains the following fields:

    • entity (string): The Entity group (classified token)
    • word (string): The word that is recognized as entity
    • property (string): The property in which the token is found
    • certainty (float): 0.0-1.0 of how certain the model is that the token is correctly classified
    • startPosition (int): The position of the first character of the word in the property value
    • endPosition (int): The position of the last character of the word in the property value

    Example response

    {
      "data": {
        "Get": {
          "Article": [
            {
              "_additional": {
                "tokens": [
                  {
                    "property": "title",
                    "entity": "PER",
                    "certainty": 0.9894614815711975,
                    "word": "Sarah",
                    "startPosition": 11,
                    "endPosition": 16
                  },
                  {
                    "property": "title",
                    "entity": "LOC",
                    "certainty": 0.7529033422470093,
                    "word": "London",
                    "startPosition": 31,
                    "endPosition": 37
                  }
                ]
              },
              "title": "My name is Sarah and I live in London"
            }
          ]
        }
      },
      "errors": null
    }
    

    Use another NER Transformer module from HuggingFace

    You can build a Docker image which supports any model from the Huggingface model hub with a two-line Dockerfile. In the following example, we are going to build a custom image for the Davlan/bert-base-multilingual-cased-ner-hrl model.

    Step 1: Create a Dockerfile

    Create a new Dockerfile. We will name it my-model.Dockerfile. Add the following lines to it:

    FROM semitechnologies/ner-transformers:custom
    RUN chmod +x ./download.py
    RUN MODEL_NAME=Davlan/bert-base-multilingual-cased-ner-hrl ./download.py
    

    Step 2: Build and tag your Dockerfile.

    We will tag our Dockerfile as davlan-bert-base-multilingual-cased-ner-hrl:

    docker build -f my-model.Dockerfile -t davlan-bert-base-multilingual-cased-ner-hrl .
    

    Step 3: That’s it!

    You can now push your image to your favorite registry or reference it locally in your Weaviate docker-compose.yaml using the Docker tag davlan-bert-base-multilingual-cased-ner-hrl.

    How it works (under the hood)

    The code for the application in this repo works well with models that take in a text input like My name is Sarah and I live in London and return information in JSON format like this:

    [
      {
        "entity_group": "PER",
        "score": 0.9985478520393372,
        "word": "Sarah",
        "start": 11,
        "end": 16
      },
      {
        "entity_group": "LOC",
        "score": 0.999621570110321,
        "word": "London",
        "start": 31,
        "end": 37
      }
    ]
    

    The Weaviate NER Module then takes this output and processes this to GraphQL output.

    More resources

    If you can’t find the answer to your question here, please look at the:

    1. Frequently Asked Questions. Or,
    2. Knowledge base of old issues. Or,
    3. For questions: Stackoverflow. Or,
    4. For issues: Github. Or,
    5. Ask your question in the Slack channel: Slack.