Retrieve a set of predicates for all nodes

Hi Devs,

I have a performance question on retrieving all nodes having any of a given set of predicates and the respective values.

When the set has only a single predicate, I can do the following (when it is a property having a literal value):

{
  result (func: has(prop)) {
    uid
    prop
  }
}

and this when it is an edge leading to another node:

{
  result (func: has(edge)) {
    uid
    edge { uid }
  }
}

When I have multiple predicates, I cannot use the func: has(...) because it only supports a single predicate. Then my query looks like:

{
  result (func: has(dgraph.type)) @filter(has(prop) OR has(edge)) {
    uid
    prop
    edge { uid }
  }
}

I have a few questions:

  • What performance impact should I expect when I 10-fold the size of the schema or graph (assuming fixed result size)? I presume the query performance scales linearly with result size.
  • Is there a better alternative to @filter? I have experienced very poor performance on a small size graph. Or should that scale well?
  • When I send those queries to alphas that do not host some of those predicates then the query will be forwarded to alphas of the group that does host those predicates. Would you say there is a benefit to only have predicates of one group in such a query and send it to alphas of that group only given this avoids extra dgraph cluster communication? Or would you argue that overhead is negligible?

If you wanna performance, you should use multiple blocks. Cuz it will do concurrent requests.

e.g:

{
  prop as var(func: has(prop))
  edge as var(func: has(edge))
  
  result (func: uid(prop,edge)) {
    uid
    prop
    edge { uid }
  }
}

BTW, you have used “graphql” in the tags. This topic isn’t about GraphQL.

1 Like

That looks promising, I will give it a try.

Will this work nicely with first: ..., offset: ... and for a hundred or a thousand predicates in one query? Or will the cluster hate me for that?

Pagination works well with indexation. If you use it without indexation, it will work, but with an indeterministic result. Cuz it will be based on UID leasing. And UIDs are given in an indeterministic way, although progressive.

Indexation as in Indexing in Dgraph? So those indices are used for has(predicate) even though that function does not lookup any value?

With indeterministic, do you refer to ascending uid order, which is non-deterministic w.r.t. which node gets which uid? This is fine, as long as the result set is deterministic, meaning I always get the same order (though it is an arbitrary order).

I was talking about the pagination point only. I don’t think has func uses index tables. Maybe it should be good. Could cut the path in the lookup. I had thought of that before.

No, when you send mutations or batches of datasets to Dgraph. It will assign UIDs for each entity. That process is indeterministic. There is no exact order as to who will receive a UID before or after. Even why tho, mutations can be concurrent.

e.g.:

A root entity structure with multiple nested branches. There is no deterministic way to create this tree (Unless your self assign the uids manually), with logically aligned UIDs. Branches can share UID offsets. That is, sometimes a node belonging to a list of nodes to which you want to paginate. Even though it is the first one defined in your application, it can involuntarily receive a UID that is not necessarily/logically the first one on that list and vice versa. That’s why using indexes are better.

Yeah, for that the result will always be the same.

Thanks for the quick response. As I am not concerned about the actual uid assignment but only the correct and efficient retrieval of data as is, this should work pretty nicely for me. Thanks!

1 Like

To give you some feedback on the proposed query, it works pretty well with after pagination which scales perfectly (constant time):

{
  prop as var(func: has(prop), first: 1000, after: 0x0)
  edge as var(func: has(edge), first: 1000, after: 0x0)
  
  result (func: uid(prop,edge), first: 1000, after: 0x0) {
    uid
    prop
    edge { uid }
  }
}

Here are some numbers:

I retrieve 10 predicates for 1000 uids in 0.05s median time.
I retrieve 100 predicates for 1000 uids in 0.5s median time.
I retrieve 1000 predicates for 1000 uids in 3s median time.
I retrieve 10000 predicates for 1000 uids in 50s median time.

I retrieve 100 predicates for 100 uids in 0.05s median time.
I retrieve 100 predicates for 1000 uids in 0.5s median time.
I retrieve 100 predicates for 10000 uids in 3s median time.

Time is server_latency.total_ns here. The retrieval time scales with the size of the page (number of nodes and predicates) and independent of the length of the un-paginated result set. So retrieving the first 1k nodes takes as long as retrieving the last 1k nodes, which is always under 50ms for 6 predicates and a 40m nodes result set. This is impressive!

Pagination with offset however does not scale at all:

{
  prop as var(func: has(prop))
  edge as var(func: has(edge))
  
  result (func: uid(prop,edge), first: 1000, offset: 1000000) {
    uid
    prop
    edge { uid }
  }
}

This query scales with the number of results without pagination, i.e. retrieving any page (even the first 1k nodes) is as fast as retrieving the entire result set. In my case this are 250s for any page of the 40m result set. Even though the result (func: uid(…)) part is limited with first: 1000, offset: 0, the query takes as long as the last page, e.g. first: 1000, offset: 1000000.

When I limit the var(func: has(…)) bit, I get a better performance with small offsets:

{
  prop as var(func: has(prop), first: 1001000)
  edge as var(func: has(edge), first: 1001000)
  
  result (func: uid(prop,edge), first: 1000, offset: 1000000) {
    uid
    prop
    edge { uid }
  }
}

This now scales with the offset, i.e. the position in the result set. This makes me think that without limiting var(func: has(…)) the entire result set for each predicate is evaluated to then retrieve only the paginated result (func: uid(…)). I think the first: 1001000 optimization could be done by dgraph automatically. Anyway, this this still scales linearly, it is not optimal, where after is.

Streaming the 40m result set with after in 1k batches will take as long as retrieving the entire result set, where the latter is prohibitive due to size.

1 Like

Hey, Enrico. Glad it worked for you.

Can you fill up and issue for the offset performance case? (I’m not 100% sure, but maybe there is an issue for this. Pls, search for it and append your experience there).

Sure, will do that. Thanks for the help!

For reference, raised feature request #5807 and bug report #5808.