Feedback: Range types

Below is a summary of our work in representing range types in dgraph- specifically time ranges for temporal graphs.

Use case

Imagine we have a network topology where a VM runs on certain hosts for discrete ranges of time

        t1-t5
        t8-11
vm1 - [runs_on] -> host1

        t5-t8
vm1 - [runs_on] -> host2

It is important for us to be able to draw the topology as it existed at any particular moment in time. Additionally, we need to answer questions alike “what hosts did vm1 run on between t7 and t15?”

Attempt 1: Facets

We first looked towards facets as an ergonomic fit, imagining something like:

     @facet timeranges = [
           {start: t1, end: t5}, 
           {start: t8, end: t11} 
         ]
vm1 ----------- [runs_on] ------------> host1

        @facet timeranges = [
           {start: t5, end: t8}
         ]
vm1 ----------- [runs_on] ------------> host2

However, facets do not support complex objects, nor lists. And even if they did, we worried that an implementation using facets would be non-performant because 1) they’re second class and 2) we time-filter on every edge, every query.

Attempt 2: Temporal nodes

Our next approach involved sticking a temporal node as an extra hop in every relationship. This new intermediary node could hold the timerange information

                  {start: t1, end: t5}
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1

                  {start: t8, end: t11}
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1

                  {start: t5, end: t8}
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host2

Because each temporal node has just one start and one end, we can use lt() and gt() queries to do the filtering. That said, we worried that the blow-up of nodes would be a performance bottleneck, so we moved on from this idea. However, with a deeper understanding of the way dgraph stores predicates, maybe this would have been fine. One thing to note is the query response would be quite large, since every “path” through the topology would be present in its non-reduced form.

Attempt 3: Temporal node with geo predicate

Next we tried collapsing the temporal nodes so there would only ever be one temporal node for each triple, containing all the valid timeranges. Our timeranges are discrete ranges on a numberline, so we we’re attracted to the intersects function that geo provides.

We created a polygon on the temporal node, representing the valid timeranges in 2d space. This didn’t work out since the indexing behind the scenes for geo is tuned for geography instead of geometry. Overall this was a bit of a longshot.

Attempt 4: Temporal node with custom tokenizer

Our current approach involves one intermediary temporal node for each unique triple, and using a custom tokenizer to hold the time range information. Our tokenizer parses a string of the format start:end, and indexes them at a 12h step.

                    [ "1:5", "8:11" ]
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1

                        [ "5:8" ]
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host2

Although a bit hacky, this works fairly well. Unfortunately by using a custom tokenizer, we loose out on the ability for dgraph to “trim” things that returned from the index but technically fall outside the bounds of the query timerange.

Outcome

We’re unsure what the best path forward is. A native generic range type that supports intersections would make our last solution more digestible, but in general:

  1. Is there a better way to model this use case that we haven’t tried?
  2. If not, is there potential for new functionality that benefits dgraph as a whole?

@mrjn @gja Here’s a summary of the timerange use-case we mentioned on the call. Obviously we’d prefer any approach that sidesteps the custom tokenizer. We have flexibility in changing our model if you believe there is a better approach :slight_smile:

Addendum

Just started thinking about a riff on #2 that would look a bit like this, which I think is now possible with parameterized cascade. I’m optimistic about this approach. We’ll give it a shot and respond with findings.

        {start: t1, end: t5}     {start: t8, end: t11}
                      \           /
                 [has_range]  [has_range]  
                        \       /
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host1
                  
     
                  {start: t5, end: t8}
                            |
                       [has_range]
                            |
vm1 --- [runs_on] ------ temporal ------ [runs_on] ---> host2
1 Like

It would appear the temporal node from @seanlaff’s post having relationships to nodes with time ranges on them seems extremely promising with the directive @cascade(has_range) and a filter on each temporal node’s has_range predicate.

If we had a range type and proper inequality filters for it, we could have a list of that type on the temporal node which we could use as the filter. Are there any drawbacks to having alot more nodes (one per range per relationship) vs keeping a longer list on the temporal node itself?

1 Like