Retrieve a set of predicates for all nodes

To give you some feedback on the proposed query, it works pretty well with after pagination which scales perfectly (constant time):

{
  prop as var(func: has(prop), first: 1000, after: 0x0)
  edge as var(func: has(edge), first: 1000, after: 0x0)
  
  result (func: uid(prop,edge), first: 1000, after: 0x0) {
    uid
    prop
    edge { uid }
  }
}

Here are some numbers:

I retrieve 10 predicates for 1000 uids in 0.05s median time.
I retrieve 100 predicates for 1000 uids in 0.5s median time.
I retrieve 1000 predicates for 1000 uids in 3s median time.
I retrieve 10000 predicates for 1000 uids in 50s median time.

I retrieve 100 predicates for 100 uids in 0.05s median time.
I retrieve 100 predicates for 1000 uids in 0.5s median time.
I retrieve 100 predicates for 10000 uids in 3s median time.

Time is server_latency.total_ns here. The retrieval time scales with the size of the page (number of nodes and predicates) and independent of the length of the un-paginated result set. So retrieving the first 1k nodes takes as long as retrieving the last 1k nodes, which is always under 50ms for 6 predicates and a 40m nodes result set. This is impressive!

Pagination with offset however does not scale at all:

{
  prop as var(func: has(prop))
  edge as var(func: has(edge))
  
  result (func: uid(prop,edge), first: 1000, offset: 1000000) {
    uid
    prop
    edge { uid }
  }
}

This query scales with the number of results without pagination, i.e. retrieving any page (even the first 1k nodes) is as fast as retrieving the entire result set. In my case this are 250s for any page of the 40m result set. Even though the result (func: uid(…)) part is limited with first: 1000, offset: 0, the query takes as long as the last page, e.g. first: 1000, offset: 1000000.

When I limit the var(func: has(…)) bit, I get a better performance with small offsets:

{
  prop as var(func: has(prop), first: 1001000)
  edge as var(func: has(edge), first: 1001000)
  
  result (func: uid(prop,edge), first: 1000, offset: 1000000) {
    uid
    prop
    edge { uid }
  }
}

This now scales with the offset, i.e. the position in the result set. This makes me think that without limiting var(func: has(…)) the entire result set for each predicate is evaluated to then retrieve only the paginated result (func: uid(…)). I think the first: 1001000 optimization could be done by dgraph automatically. Anyway, this this still scales linearly, it is not optimal, where after is.

Streaming the 40m result set with after in 1k batches will take as long as retrieving the entire result set, where the latter is prohibitive due to size.

1 Like