String matching in Dgraph v0.7.4


(Tomasz Zdybał) #1

Recent release of Dgraph is packed with new features and improvements. Many of them are related to strings - full text search (with support for 15 languages!) and regular expressions matching was added, handling of string values in multiple languages was greatly improved. All those changes forms a consistent workspace for text searching in multilingual environment.

Values in Many Languages

We’re working hard to keep the query language clear convenient. Dgraph v0.7.4 adopted and extended the language tag syntax from RDF N-Quads standard. It is intuitive, well-known and was partially supported in previous versions (during data loading).

Let’s start from the beginning - the data. Dgraph uses RDF N-Quads for data loading and backup. String literals in N-Quads may be followed by @ sign and language tag, e.g. "badger"@en or "Dachs"@de. Multiple such literals my be used as a value for a single entity/attribute pair.

When querying for a predicate with multiple values, user is able to use the @lang notation known from RDF N-Quads. Moreover, many languages can be specified in a list of preference, e.g. @en:de denotes that most preferred language is English, but if such value is not present, value in German should be returned.

Language can be specified also in functions, which is important especially for full text search.

Example Data

As this post is string-oriented, minimum working example data set is used - data consists of movie titles in multiple languages, with no relations or additional information. Dataset is an extract of movie titles in English, Italian, and German from Freebase film data.

Example lines from data file:

<m.0114pmh6>    <name>  "I pirati della costa"@it       .
<m.0114pmh6>    <name>  "Küste der Piraten"@de  .
<m.0114pmh6>    <name>  "Pirates of the Coast"@en       .
<m.0116w1sc>    <name>  "Arrivederci Francesca"@it      .
<m.0116w1sc>    <name>  "Auf Wiedersehn, Franziska"@de  .
<m.0116w1sc>    <name>  "Goodbye, Franziska"@en .

Entire file can be downloaded.

Schema is very simple - it defines 3 types of indexes for name:

name: string @index(term, fulltext, exact) . 

term index is used for term matching with allofterms and anyofterms functions. It was the only string index available in previous releases of Dgraph.

fulltext index is self explanatory. One thing worth noting is, that values indexed with fulltext are processed according to their’s language (if they are tagged). If values are untagged, English is used as a default language.

exact index is necessary for regular expression matching.

Full Text Search (FTS)

Very Short Introduction to Natural Language Processing (NLP)

By definition (from Wikipedia):

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user).

This may sound trivial, but it’s not. Searching for exact form of the word is not always satisfying for the user. For example nouns can be singular or plural, verbs have grammatical tenses, etc., and user may be interested in all values related to the word, in any inflected or derived form.

The simple but powerful idea is to find a method, that can transform all the forms of a word to some common base. This process is called stemming. For many natural languages (including English) stemmers may be implemented using set of well known grammatical rules. There are also languages (like Polish) where dictionary based approach is required (i.e. inflected form -> stem mapping). Only for the first category of languages stemmers are widely available.

Another problem with search are the words that are very common, like the, is, at. In most cases searching for them gives enormous amount of results which are useless. Those words are called stop words. Again, stop words are language specific. The common method of handling those words is just to remove them from the search.

Dgraph FTS/NLP Processing

Following steps are applied to both data (while indexing) and the query pattern:

  1. Tokenization - text is divided into words.
  2. Normalization - all letters are transformed to lowercase, Unicode Normalization is applied.
  3. Stop words removal.
  4. Stemming.

Stop words lists we use contain inflected forms, and because of that stop words are removed before stemming.

Full Text Search Functions

There are two new functions for basic support of full text search:

  1. alloftext - searches for values that contains all the specified words (using NLP).
  2. anyoftext - searches for values that contains one or more of the specified tokens (using NLP).
Examples

Because of the simplicity of the example data I’m not using our cool new user interface and stick to the curl.

Let’s query for black or white, using the term matching function allofterms.

curl localhost:8080/query -XPOST -d $'
{
  me(func:allofterms(name@en, "black or white")) {
    name@en
    name@de
    name@it
  }
}' | python -m json.tool | less

The query gives no results. It may be worth trying less strict match with alloftext function:

curl localhost:8080/query -XPOST -d $'
{
  me(func:alloftext(name@en, "black or white")) {
    name@en
    name@de
    name@it
  }
}' | python -m json.tool | less

Query returns 9 results. This example shows that removing a stop word may help in some cases.

{
    "me": [
        {
            "name@de": "Black and White",
            "name@en": "Black and White",
            "name@it": "Black & White"
        },
        {
            "name@de": "Wei\u00dfer J\u00e4ger, schwarzes Herz",
            "name@en": "White Hunter, Black Heart",
            "name@it": "Cacciatore bianco, cuore nero"
        },
        {
            "name@de": "Schwarze Katze, wei\u00dfer Kater",
            "name@en": "Black Cat, White Cat",
            "name@it": "Gatto nero, gatto bianco"
        },
        {
            "name@de": "Stetson \u2013 Drei Halunken erster Klasse",
            "name@en": "The White, the Yellow, and the Black",
            "name@it": "Il bianco, il giallo, il nero"
        },
        {
            "name@de": "Pok\u00e9mon: Der Film: Schwarz - Victini und Reshiram / Wei\u00df - Victini und Zekrom",
            "name@en": "Pok\u00e9mon the Movie: Black\u2014Victini and Reshiram and White\u2014Victini and Zekrom",
            "name@it": "Pok\u00e9mon il film: Nero e Bianco"
        },
        {
            "name@de": "Wei\u00dfer Bim Schwarzohr",
            "name@en": "White Bim Black Ear",
            "name@it": "Bim bianco dall'orecchio nero"
        },
        {
            "name@de": "Sehnsucht nach Afrika",
            "name@en": "Black and White in Color",
            "name@it": "Bianco e nero a colori"
        },
        {
            "name@de": "Roy Orbison and Friends: A Black and White Night",
            "name@en": "Roy Orbison and Friends: A Black and White Night",
            "name@it": "Roy Orbison - Iends. a Black and White Night"
        }
    ]
}

In context of NLP, English is quite easy - there are no diacritics, inflection is rather simple, etc. So let’s try similar query in German:

curl localhost:8080/query -XPOST -d $'
{
  me(func:allofterms(name@de, "schwarz oder weiss")) {
	name@de
	name@en
	name@it
  }
}' | python -m json.tool | less

Again, the query doesn’t return any results.

The NLP-enabled version of this query:

curl localhost:8080/query -XPOST -d $'
{
  me(func:alloftext(name@de, "schwarz oder weiss")) {
	name@de
	name@en
	name@it
  }
}' | python -m json.tool | less

returns 4 results:

{
    "me": [
        {
            "name@de": "Wei\u00dfer J\u00e4ger, schwarzes Herz",
            "name@en": "White Hunter, Black Heart",
            "name@it": "Cacciatore bianco, cuore nero"
        },
        {
            "name@de": "Schwarze Katze, wei\u00dfer Kater",
            "name@en": "Black Cat, White Cat",
            "name@it": "Gatto nero, gatto bianco"
        },
        {
            "name@de": "Pok\u00e9mon: Der Film: Schwarz - Victini und Reshiram / Wei\u00df - Victini und Zekrom",
            "name@en": "Pok\u00e9mon the Movie: Black\u2014Victini and Reshiram and White\u2014Victini and Zekrom",
            "name@it": "Pok\u00e9mon il film: Nero e Bianco"
        },
        {
            "name@de": "Blancanieves - Ein M\u00e4rchen von Schwarz und Weiss",
            "name@en": "Snow White",
            "name@it": "Blancanieves"
        }
    ]
}

What is worth noting are the inflected forms of word schwarz - schwarzes and Schwartze.
Also the Wei\u00dfer is interesting - \u00df is the escaped Unicode value of grapheme ß.
weiss matched Weißer - the form is inflected, and grapheme equivalency is preserved.
Like in the English example, stop word (oder) is ignored.

Gotchas

In some cases, natural language processing can lead to surprising results.
Let’s search for the answer to the famous question: To be, or not to be?:

curl localhost:8080/query -XPOST -d $'
{
  me(func:alloftext(name@en, "To be, or not to be?")) {
	name@en
  }
}' | python -m json.tool | less

The query gives no results, while the term matching query:

curl localhost:8080/query -XPOST -d $'
{
  me(func:allofterms(name@en, "To be, or not to be?")) {
	name@en
  }
}' | python -m json.tool | less

gives two results:

{
    "me": [
        {
            "name@en": "To Be or Not to Be"
        },
        {
            "name@en": "To Be or Not to Be"
        }
    ]
}

What happened? To be, or not to be? consists of stop words only. After FTS/NLP processing, there are no movies that match the query.

Regular Expressions (regexp)

Regular expressions are extremely useful for creating sophisticated matchers.

For example, all titles starting with a word containing night but not knight may be matched using following query:

curl localhost:8080/query -XPOST -d $'
{
  me(func:regexp(name, "^[a-zA-z]*[^Kk ]?[Nn]ight")) {
	name@en
	name@de
	name@it
  }
}' | python -m json.tool | less

There are 84 results in the test dataset - too much to quote.

Summary

Dgraph supports many very powerful ways of string matching. I hope this post will help you choose the one best for you particular needs. Natural language processing employed for full text search may be the best choice for lookup based on users input. If more strict matching is required, term matching should give good results. And to get the most precise results of complicated text searches regular expressions can be used.


(Motakjuq) #2

you should post this entry into dgraph blog for not losing the information.


(Manish R Jain) #3

We’ll be posting this on Monday. @pawan @tzdybal