Below is a summary of our work in representing range
types in dgraph- specifically time ranges for temporal graphs.
Use case
Imagine we have a network topology where a VM runs on certain hosts for discrete ranges of time
t1-t5
t8-11
vm1 - [runs_on] -> host1
t5-t8
vm1 - [runs_on] -> host2
It is important for us to be able to draw the topology as it existed at any particular moment in time. Additionally, we need to answer questions alike “what hosts did vm1 run on between t7 and t15?”
Attempt 1: Facets
We first looked towards facets as an ergonomic fit, imagining something like:
@facet timeranges = [
{start: t1, end: t5},
{start: t8, end: t11}
]
vm1 ----------- [runs_on] ------------> host1
@facet timeranges = [
{start: t5, end: t8}
]
vm1 ----------- [runs_on] ------------> host2
However, facets do not support complex objects, nor lists. And even if they did, we worried that an implementation using facets would be non-performant because 1) they’re second class and 2) we time-filter on every edge, every query.
Attempt 2: Temporal nodes
Our next approach involved sticking a temporal
node as an extra hop in every relationship. This new intermediary node could hold the timerange information
{start: t1, end: t5}
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1
{start: t8, end: t11}
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1
{start: t5, end: t8}
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host2
Because each temporal node has just one start
and one end
, we can use lt()
and gt()
queries to do the filtering. That said, we worried that the blow-up of nodes would be a performance bottleneck, so we moved on from this idea. However, with a deeper understanding of the way dgraph stores predicates, maybe this would have been fine. One thing to note is the query response would be quite large, since every “path” through the topology would be present in its non-reduced form.
Attempt 3: Temporal node with geo predicate
Next we tried collapsing the temporal nodes so there would only ever be one temporal node for each triple, containing all the valid timeranges. Our timeranges are discrete ranges on a numberline, so we we’re attracted to the intersects
function that geo provides.
We created a polygon on the temporal
node, representing the valid timeranges in 2d space. This didn’t work out since the indexing behind the scenes for geo is tuned for geography instead of geometry. Overall this was a bit of a longshot.
Attempt 4: Temporal node with custom tokenizer
Our current approach involves one intermediary temporal node for each unique triple, and using a custom tokenizer to hold the time range information. Our tokenizer parses a string of the format start:end
, and indexes them at a 12h step.
[ "1:5", "8:11" ]
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1
[ "5:8" ]
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host2
Although a bit hacky, this works fairly well. Unfortunately by using a custom tokenizer, we loose out on the ability for dgraph to “trim” things that returned from the index but technically fall outside the bounds of the query timerange.
Outcome
We’re unsure what the best path forward is. A native generic range
type that supports intersections would make our last solution more digestible, but in general:
- Is there a better way to model this use case that we haven’t tried?
- If not, is there potential for new functionality that benefits dgraph as a whole?
@mrjn @gja Here’s a summary of the timerange use-case we mentioned on the call. Obviously we’d prefer any approach that sidesteps the custom tokenizer. We have flexibility in changing our model if you believe there is a better approach
Addendum
Just started thinking about a riff on #2 that would look a bit like this, which I think is now possible with parameterized cascade. I’m optimistic about this approach. We’ll give it a shot and respond with findings.
{start: t1, end: t5} {start: t8, end: t11}
\ /
[has_range] [has_range]
\ /
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1
{start: t5, end: t8}
|
[has_range]
|
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host2