Support Comparison in Custom Tokenizers

Moved from GitHub dgraph/5669

Posted by seanlaff:

Custom tokenizers are awesome, but their utility is hampered by only being queryable through anyof and allof functions. This restriction means users cannot query more-granular than the index.

Proposal: Allow custom tokenizers to implement an interface akin to the CompareVals() func, supporting ge, gt, le, lt, eq.

Example

Imagine we have a project like dgraph that is switching from semantic versioning to calendar versioning. We track bugs and each bug has a version predicate. For compatibility, we’re going to push both a semVar and calVar copy of the release for a while to ease the transition.

Say we track bugs in dgraph, and we want to make the querying experience as easy for our users as possible, so we create a new custom tokenizer softwareVersionTokenizer to handle unifying the two different models (ie. v20.03.3 is equivalent to 3.3.3).

Problem

If I wanted bugs that were in v20.03.3 the best I can query I can write is

q(func: anyof(version, softwareVersion, "v20.03.3"))

However, what if the indexing strategy of the tokenizer is by, say, the year field? I would get back results for minor versions I didn’t want (v20.03.2, v20.03.1, etc). The user needs knowledge of the inner workings of the tokenizer.

If the custom tokenizer support the CompareVals interface, we could write queries like

q(func: eq(version, "v20.03.3"))

which could return both v20.03.3 bugs and 3.3.3 bugs.

or queries like this

q(func: gt(version, "v20.03.3"))

which could return bugs from calVer > v20.03.3 and semVer > 3.3.3

Other examples

We have a use case as described here (see Custom Tokenizer with Ranges ) for tracking a list of start:end time ranges in a predicate via a custom tokenizer. We index them at 12h resolution to keep the index a reasonable size. Ideally we’d like to provide logic that allows dgraph to do a first-pass search against our index, and then use our comparison funs to narrow down to the exact matches (akin to how the native date index works).

Limitations

From what I’ve seen, The matching token keys from badger are read in lexicographically-sorted order when doing a < or >… which may not make sense based on what tokens are generated. Additionally, there’s some short-circuits if eq is used against something that returns multiple tokens, which doesn’t make sense for all tokenizers.

Maybe what I’m describing here is actually custom types? Or maybe custom query functions?.. Perhaps both? Hard to say what the right direction is.

Anyway, what I’m trying to get at is there’s a lot of power in extending dgraph and would like to explore that further :slight_smile:

1 Like