Bleve is used for tokenization (in terms of Dgraph). Custom bleve analysers are used to provide langauge dependent stemming. External stop words lists are used, instead of those provided by bleve, to ensure that for every language-dependent stemmer, stop words are available (which is not true for bleve).
Single attribute indexes, without language information (it’s possible, that token corresponds to values in multiple languages). No information about token position or count is stored.
Currently extra processing step is used to filter out values that matched in wrong language (example:
allofterms(name@pl, "Badger"); index returns match for
<badger> <name> "Badger"@en .)
It’s possible to implement per-language indexing, where language is stored with token. This way, extra filtering step may be removed (performance gain is expected).
Possible new features
Multiple changes are required to enable matching against many attributes at a time. First of all, new syntax is required. Some proposals:
alloftext(attr1, attr2, ...., attrN, "query string")
alloftext([attr1, attr2, ...., attrN], "query string")
alloftext("attr1 attr2, ...., attrN", "query string")
The processing of multi-attribute matching has to be implemented. There are at least two possibilities:
- Introduce multi-attribute index.
- Use current single-attribute indices.
The second option seems to be quite simple.
anyoftext index results for all attributes can simply be merged. This is equivalent to
anyoftext(attr1, "query string") OR anyoftext(attr2, "query string") OR ... OR anyoftext(attrN, "query string")
In case of
alloftext processing is more complicated - index results of all attributes has to be merged per token. Then, intersection of those merged uid lists needs to be calculated.
Following example data and filter function shows why such processing is necessary.
<movie1> <name> "Django Unchained" <movie1> <descr> "History of a bounty hunter" allofterms([name, descr], "Django bounty")
Term weighting is a prerequisite for many NLP functionalities, and can also be used standalone for text analysis.
Currently, no statistics about tokenized values are gathered. Counting of token occurrences is the simplest statistic. It can be used for TF-IDF calculation. TF-IDF implementation from bleve can be used, because it’s doesn’t depend on bleve indexing, we may provide our statistics. Simple TF (term frequency) may also be used.
Idea to check: To avoid global “counting”, statistics may be build just-in-time using only matching values (as we’re going to process those values anyways).
There are different scoring methods.
- Term weights and TF-IDF may be used to calculate the score for a matching value.
- Edit distance (Levenshtein Distance) may also be used.
Fuzzy search is a very wide term. There are many approa ches to the problem of approximate string matching.
trigram indexes can be used for initial matching.
Fuzzy search may be implemented using
anyoftext with scoring. If Levenshtein Distance is used for scoring, no statistics are required (this is probably the simplest solution). For values tokenized in language agnostic manner, we may calculate edit distance of tokens, not letters/chars. On the other hand, using letters is useful for user input that can contain typos.
Probably the simplest way of implementation of new NLP features is to execute following steps:
- Find matches using index.
- Retrieve values of matched strings and calculate some metrics.
- Sort and limit the results.
Different metrics could be used for scoring. Some of them requires extra metadata extracted from all values, other can be calculated online, during query processing.