Hey there. Please bear with me - I’m an economist who is a graph enthusiast, not an engineer - but I was wondering if anyone more knowledgeable or skilled than myself might have some thoughts, feedback or critique about the idea of Dgraph serving as a gateway for data federation.
Problem Statement
There are many cases where you might have multiple data stores that have data that are related in some way but which have been separated for any number of reasons (i.e. slow migration to Dgraph as a primary data store, legacy systems, tightly controlled access to sensitive/protected data, etc.). It would be pretty nice if you could simply query across them like:
{
dgraphResource {
randomStringPredicate
...
someExternalResource {
externalPredicates...
predicateRefToDgraphResource {
otherDgraphPredicates
}
}
}
}
Possible Approach
My understanding is that Dgraph currently shards based on predicate and, when querying, uses gRPC to convey the operation expected from any given shard as the subgraph is traversed. What if… these “external”/non-Dgraph predicate references could be routed to a specific cluster/node that could receive this gRPC request and return a conforming gRPC response, but leaving it to whomever implements the logic behind the scenes to dictate how that gRPC response is generated?
Edit: After some more source spelunking, here are some additional entry points for this implementation:
-
ProcessGraph
(query/query.go) callscreateTaskQuery
to generate a protobuf for the query subgraph. - The query protobuf is then passed through the worker’s
ProcessTaskOverNetwork
(worker/task.go) method. - That method calls the worker’s groups to see which
gid
(group id) is serving the tablet for the attribute key on the query protobuf by polling the Zero serverShouldServe
method. - If the attribute isn’t being served by any tablets (which it isn’t because the whole idea is external attributes), it’s going to yield a
gid
of 0 which will throw anerrNonExistentTablet
error and an empty Result. - If a
gid
could be received, it would be passed through toprocessWithBackupRequest
that would use thegid
to reference two server addresses to process the request.
So, a solution might involve registering a Tablet with the Zero server that houses all external attributes as defined in the schema, creating a Group (with gid
) to serve that Tablet which routes to other backend sources, and otherwise letting Dgraph handle the query/result RPCs as it normally would.
Feedback?
It’s probably a hair-brained thought, but I’d love to hear your feedback (even if it’s negative).