Add traversal timeouts and edge limits in queries

Moved from GitHub dgraph/5784

Posted by danielmai:

Experience Report

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

I want to run queries where individual traversals can be set with a timeout. This timeout can potentially trim the results of the query instead of returning all the data, but that’s OK. The use case has strict latency requirements and the query shouldn’t run longer than say 50ms.

What you actually did

Dgraph currently has query timeouts or cancellations. But this only returns a message saying the query was cancelled and does not returns any actual data of what it was able to process within the timeout.

Why that wasn’t great, with examples

One of the queries I want to run is a recursive query. It tries to go deep enough into the graph to return results. I want to be able to run a query where the query should stop after finding N connected nodes or if it hits a timeout.

One thing I can do today is run the recursive query multiple times with an increasing depth argument. But there’s no way to set a time limit while also getting a result if the time limit is reached.

{
  q(func: eq(id, "...")) @recurse(loop: false, depth: 2) {
     id
     connects
     ~connects
  }
}

Any external references to support your case

Gremlin has a .timeLimit() and .limit() steps that can be used to do this trimming that has been key to doing traversal timeouts with result limits in other systems.

Example:

GraphTraversal t =  traversal()
                .withRemote(DriverRemoteConnection.using(client))
                .V(identifiers)
                .repeat(timeLimit(queryTimeout)
                        .out()
                        .hasLabel("connection")
                        .in()
                        .dedup()
                        .by(id())
                        .simplePath()
                        .timeLimit(25))
                .until(hasLabel("id"))
                .limit(maxResults)
                .id();

Adding a couple of ideas here.

Use the context library

We continue using the context library for timeouts but intercept the deadline exceeded errors and instead return whatever was retrieved by that point.

This approach seems easier to implement as it would not require changes to the parser or the query language. However, the disadvantages of these approach are that a lot of the error handling code would have to change and there’s the possibility that forgetting to handle those errors in some places would lead to the error bubbling up.

Implement this feature into the query language

Introducing a new directive @timelimit that can be set at any node in the query. The value passed to it would be the limit that processing that subgraph should take.
For example, the query in the previous post would look like.

{
  q(func: eq(id, "...")) @recurse(loop: false, depth: 2) @timelimit(1s) {
     id
     connects
     ~connects
  }
}

Internally, the subgraphs would include the timelimit and a start date and compare the dates before doing anywork. If the maximum time has elapsed, an empty result is sent instead of a “deadline exceeded” error that propagates through the call stack.

Ideally, the directive can be included at any level of the graph. This requires that we define the behavior when a subgraph includes the directive and a parent of this subgraph includes a different value for it.

This option requires more changes but also is more flexible, as it allow us to have different limits for specific subgraphs and a similar mechanism could be use to expand to other types of limits.

Both solutions are a soft limit. In the case of the context based cancellation, the deadline is currently checked before doing the task. For the other approach, the check would also need to happen before. So it’s possible that queries run for longer if the last level of the graph takes more to process than the time left for processing.

Could we see what that might be? Seems like this is a simpler approach that will apply to /GQL as well since they only support native GQL. Your other approch will be GQL± specific, right?

If we go down this path, maybe a simpler initial feature would be 1 global time limit instead of time limit at each level / sub-graph.

Right. See if one global time limit at root meets @dmai requirements and brings us on par with Gremlin.

Is there a way to “estimate” the query cost of a sub-graph based on the structure of the sub-graph itself.

Other notes:

  1. What about multiple query blocks? var blocks?
  2. How can we make this applicable to GQL and GQL±.
  3. We need to have some way of encoding that the timelimit was reached (or not). (maybe in the response’s extensions struct which has other stats".
  4. With time-limit, the queries output will now be non-deterministic. So need to be cognizant of this.
  5. What granularity of timelimit do we allow? second, ms, micro?

@pawan should also look at this.