Bulk Upsert in Live Loader

We have received a reqest for upserts in live loader from multiple users. This would allow people to run the live loader again with same data, without it making new nodes. This could be done with -x argument, but that would require users to store XidMap .

Alternate solutions:

  1. Each mutation is an upsert mutation. PR
  • Pros: Faster
  • Cons: Alpha crashes around 8 million RDFs.
(pprof) top
Showing nodes accounting for 1516.21MB, 97.42% of 1556.36MB total
Dropped 138 nodes (cum <= 7.78MB)
Showing top 10 nodes out of 102
      flat  flat%   sum%        cum   cum%
  599.84MB 38.54% 38.54%   599.84MB 38.54%  github.com/DataDog/zstd.Decompress
  384.79MB 24.72% 63.26%   384.79MB 24.72%  github.com/dgraph-io/ristretto.newCmRow
  214.39MB 13.77% 77.04%   214.39MB 13.77%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
  166.41MB 10.69% 87.73%   166.41MB 10.69%  github.com/dgraph-io/badger/v2/skl.newArena
   67.36MB  4.33% 92.06%   110.32MB  7.09%  github.com/dgraph-io/badger/v2/table.OpenTable
(pprof) top
Showing nodes accounting for 204.01GB, 47.87% of 426.18GB total
Dropped 706 nodes (cum <= 2.13GB)
Showing top 10 nodes out of 229
      flat  flat%   sum%        cum   cum%
   48.51GB 11.38% 11.38%    53.71GB 12.60%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
   23.70GB  5.56% 16.94%    23.70GB  5.56%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
   23.51GB  5.52% 22.46%    23.51GB  5.52%  encoding/json.(*decodeState).literalStore
   22.88GB  5.37% 27.83%    22.88GB  5.37%  go.opencensus.io/trace.(*Span).interfaceArrayToAnnotationArray
   21.73GB  5.10% 32.93%    21.73GB  5.10%  go.opencensus.io/trace.copyAttributes
   20.31GB  4.77% 37.70%    20.31GB  4.77%  github.com/dgraph-io/dgraph/lex.(*Lexer).Emit
   12.35GB  2.90% 40.59%    17.61GB  4.13%  github.com/dgraph-io/badger/v2/table.(*Table).blockOffsets
  1. The blank node acts as a xid. Whenever we are generating an NQuad, we would get the blank node to UID from XidMap. We can intercept at this point. Instead of leasing another UID from zero, we ask from Dgraph by doing an upsert mutation first. PR
  • Pros: Lower upserts required, leading to low memory usage
  • Cons: Slower, as we would have to query for each request separately. (26 mins for 21 million dataset)

What is the reason? OOM? On liveloader instance or Dgraph nodes?

Have you tested with ludicrous Mode?

I have questions about the procedure. How does it work? Reading the test file feels like it is simple, but not sure how the user would do it on the user end.

Do I have to give an upsert query in the RDF body? (unlikely based on the code)
Dgraph analyzes/infer the RDF and generates the Upsert query? How?
The user just needs to give any RDF and set the upsert flag?

On Dgraph instance. Our heap data is very less, so either we have a lot of data mmaped onto ram, or we allocate and deallocate a lot.

Upserts don’t work with ludicrous mode

That’s it. That’s how the user would use it. All we are supporting is that blank node would be saved in a predicate. So instead of saving blank node to uid mapping on disk, we store it in Dgraph. The user just needs to give any RDF and set the upsert flag.

1 Like

This would also help in debugging import scripts where the script maps a database to NQuads with blank nodes. After the import is is almost impossible to map up blank nodes back to uids to figure out where in the generated NQuads the bug lies which will lead to finding the bug in the db->nQuad mapping script.

1 Like

That’s was also what I thought cuz it is a transaction-based, but it works.

So, now --store_xids will be useful, got it.