Custom Tokenizer with Ranges

iluminae · June 11, 2020, 8:02pm

I am in need of the ability to filter ranges of times as predicate values. Our use case has many unique distinct time-ranges on nodes, and we have tried several implementations to try and efficiently index and query these ranges. Basically, our query pattern is - given timespan X->Y does a predicate include any of the same timerange.

I have tried using the geo-spatial index built into dgraph to achieve this by drawing bounding boxes around a flat line representing a timeline. This seems to work, except:

the query looks like trash:
q(func: intersects(tpred,[[[1591901627,0],[1591901627,1],[1591901700,1],[1591901700,0],[1591901627,0]]])){uid}
at large numbers, the geospatial index is horribly inaccurate - since only up to 18 tokens are generated to cover the range. Obviously this was done to closely match spherical mathematics, to be used on a globe. My use case is cartesian, and therefore incompatible with this index. I can divide my times by 1d9 but that makes the query even more trash.

So next we tried a custom tokenizer. I tried first to basically replicate your builtin geo index, but on a cartesian plane - but that really didnt solve the query looking like trash. So, I switched the type on the custom tokenizer to be string and made up my own meaning for that string. Eg:

_:id1 <tpred> "1591901627:1591901700" .
q(func: anyof(tpred,"mytokenizer","1591901690:1591901800")) { uid }

The query looks great, but to achieve this I had to emit a ton of tokens from my tokenizer. For a month timespan, I emitted 720 tokens (one per hour) and it was only accurate to the hour. Larger timespans were less tokens, but reduced accuracy further. There is also no way for me to represent a half open timespan (time X->inf).

Which brings me to my suggestion - the custom tokenizer interface is very simple and very easy to write a custom plugin for. However, If I was given a slightly expanded interface:

type Tokenizer interface {
	Name() string
	Type() string
	Identifier() byte
	IsSortable() bool
	IsLossy() bool

	Tokens(interface{}) ([]string, error)
	Equal(string,string) bool
}

Then my custom tokenizer plugin could decide for itself if two tokens were equal for the use case of that tokenizer. This would greatly improve the power of the tokenizer. I could for instance emit only one token being the timerange value itself, and use the execution of the Equal() method to judge if the query timespan X->Y exists at all within token X’->Y’.

This change would make my tokenizer that wants to emit a ton of tokens just emit one token and would probably help solve this issue.

Thoughts? Thanks!

Topic		Replies	Views
Feedback: Range types Dgraph kind:feature	1	497	October 2, 2020
Support Comparison in Custom Tokenizers Dgraph dgraph , untagged	0	562	June 17, 2020
Indexing with Custom Tokenizers - Query language Documentation	0	530	August 28, 2020
Datetime Indexes in Dgraph - Dgraph Blog Blog	1	758	August 31, 2019
Problem detected when run dgraph alpha with custom tokeniser Dgraph tokenization	2	501	March 15, 2021

Custom Tokenizer with Ranges

Related topics