Sample your result

Experience Report for Feature Request

What you wanted to do

The Spark Dgraph Connector reads the entire Dgraph database and loads it into Spark for further processing. For this to scale, the graph is split into sub-graph partitions. The smallest partition is a single predicate, but a single predicate could still be the entire graph (e.g. a dataset with un-typed links in the WWW).

To further partition the data, the connector paginates the result into chunks of uids. Pagination with offset does not scale and only after has shown constant access time: Pagination with `offset` should scale like `after`.

Where after allows only for iterating over the result set, offset supports random access to equal sized chunks of uids and to read an arbitrary page. If you would know the first uid of such a page, you could also random-access the page via after and benefit from its constant access time.

I would like to suggest a feature that allows to retrieve a sample of the result uids only. Consider the following query:

pred1 as var(func: has(<dgraph.graphql.schema>))
pred2 as var(func: has(<dgraph.graphql.xid>))
pred3 as var(func: has(<dgraph.type>))

result (func: uid_sample(1000, pred1, pred2, pred3)) {
  uid
}

This is generally useful to sample your result set by retrieving every N-th result and not the first k results (in uid order).

In my use-case, this returns every 1,000th uid of the result only (in uid order). For each uid (e.g. 0x123) in this set of uids, I can then read a chunk of the full result with constant access time:

pred1 as var(func: has(<dgraph.graphql.schema>), first: 1000, after: 0x123)
pred2 as var(func: has(<dgraph.graphql.xid>), first: 1000, after: 0x123)
pred3 as var(func: has(<dgraph.type>), first: 1000, after: 0x123)

result (func: uid(pred1,pred2,pred3), first: 1000, after: 0x123) {
  uid
  <dgraph.graphql.schema>
  <dgraph.graphql.xid>
  <dgraph.type>
}

For my use-case, an approximation of sample would also be fine, maybe something similar to Implement approximate counts using HyperLogLog++ algorithm to get this scale sub-linear with the (non-sampled) result size.

2 Likes