EnricoMi commented :
I am currently looking into this. I have created a very simple connector that can load data from DGraph into Spark. I have a few fundamental questions / requirements to get to a robust solution.
Spark is a massively parallel cluster where hundreds or thousands of concurrent reads would hit the DGraph cluster. For this to work the Spark connector 1) needs to know the exact partitioning and available alphas existing in DGraph, and 2) needs to be able to query / read non-overlapping and deterministic parts (partitions) of the data. Is there a way to get access to the partitioning and cluster setup through the grpc endpoint? What I would need is the groups, the predicate range per group and the alphas per group. Would I at al be able to query a given alpha directlyl? Handling a change of this partitioning will be challenging.
The next thing is to be able to read fraction of the graph. Lets consider two use-cases: I) read the entire graph and II) given a GraphQL query, read the entire result set. For each of these use-cases there needs to be a way to retrieve a fixed fraction (in Spark called a partition, not 1:1 a DGraph partition, but for simplicity and performance a Spark partition is a well defined subset of a DGraph partition). How could I read a single partition of DGraph data? Simplest approach would be
has(PROPERTY) for the set of properties in a partition.
Spark partitions are immutable. Can we, for the lifetime of a Spark partition that is mapped to a DGraph partition, guarantee we see an immutable graph / a snapshot?
In use-case II), a Spark partition is not mapped to a DGraph partition anymore, because the data potentially come from multiple DGraph partitions already. How could the result set be split into Spark partitions? With pagination? That would require to know the exact number of results in the first place. The result
uid space could be split into partition ranges. Are range queries like
0x1 <= uid < 0x9 supported and efficient on an arbitrary GraphQL query? Can I get the space of possible
uids / the min and max
uid existing in the DGraph?
Finally, given a single Spark partition ideally is in the order of 1 million rows / predicates / triples, a streaming JSON result from grpc would be desirable. Using the example of the official Java client,
client.newReadOnlyTransaction().query(query) loads the entire response JSON into memory, then parses it into a JsonObject or something alike before I can provide it to the iterative Spark API. Does the grpc support a JSON stream? Do you know if the Java client supports that stream of results if grpc does?