RFC: Is Mutation request really distributed?

Request for Comments

Summary

In Dgraph, queries are executed in a distributed fashion at the predicate level. But if you separate it into multiple blocks, the execution gets a boost because each block is also executed concurrently - reducing latency. However, it does not seem to be such a doable reality for mutations.

Apparently, mutations are not completely distributed in the first stage of the request - but in the other stages the process is distributed and then sent to its respective group. Following How mutations are processed and committed in dgraph for further context. It seems to me that in the doMutate step, the process of deserializing the RDF (or JSON) for objects to be mutated, that this consumes a lot of resource from the instance that was requested to mutate.

The indications shows the increase in memory consumption on the instance. If the user uses only one instance to carry out mutations instead of distributing manually. Quickly that instance starts to consume resources (RAM in particular). While others are relatively idle. Which is actually a fact. Not just a indication.

Other indications come from the behavior of LiveLoad. This one, when informed the URLs of each instance of the cluster, it is able to try to make evenly the requests for each instance. It’s not perfect as I’ve observed through Grafana and GKE dashboard data. But it works to indicate that it is possible to balance mutations and avoid OOM and latency increase.

Motivation

My intention is to identify if this really happens at this stage. And propose a kind of balancing of N-quads. Allowing the atomic data (triple, nquad) to be sent in a balanced way between all available instances in the cluster and only then deserialize it. Taking into account the Blank Nodes context and its UIDs.

Workaround

Basically perform application-level mutations(backend) in a balanced way using some algorithm of your choice.

If you want to guarantee lower latencies. You can also create preferred node schemes. Reserve nodes for queries and others only for mutations and combine with a manual balance.

The ideal is to try to balance by predicate. But small batches is still viable. If you have EE license, you can use learner nodes for read-only.

cc: @gajanan, @sudhish, @akon

2 Likes