Many small mutations vs one large. Best usage patterns


Let’s say my app consumes data from an external API which returns up to around 5k entities per request.

General structure of an entity in api response looks like this:

  "uuid": "1234",
  "parent_uuid": "4321",
  "name": "SomeName",

Saving those in Dgraph calls for an upsert.

It is trivial to write an upsert mutation for a single entity i.e:

upsert {
  query {
    t as var(func: eq(xid, $uuid)) {uid}
    p as var(func: eq(xid, $parent_uuid)) {uid}

  mutation {
    set {
      uid(t) <name> "SomeName" .

  mutation @if(gt(len(p),0)) {
    set {
      uid(p) <link> uid(t) .

Saving 5k of those entities would require running this block 5k times.

The other way (and that’s how I do it right now) is to first run a query and fetch all present target and parent nodes and then generate one huge NQUAD mutation, replacing blank nodes with matches where necessary and adding RDFs for links.

Which approach is better in terms of Dgraph ways of doing things? Which is more performant?

Does Dgraph optimize 5k runs of the upsert above if it happens under single transaction?

I have found that batching in many-thousands is pretty performant. I have batches of 1000 in my ingestion pipeline. It is a little awkward to make a bunch of unique variables in the query portion and use them below in the nquads, but once that is solved its fine to have a knob on arbitrary sized batches you can play with/optimize for your use case.

By batching you mean code-generating the query portion of request with thousands of variables and send that as a single transaction to Dgraph?

yea I currently have 1000 variables being resolved in the query{…} section and using them all in the set{…} section below. I hash the unique id of each thing to make its variable, and prefix the hash since a variable cannot start with a number.

1 Like

Thanks will try that!