Bulk Upsert in Live Loader

harshil_goel · July 22, 2020, 4:40pm

We have received a reqest for upserts in live loader from multiple users. This would allow people to run the live loader again with same data, without it making new nodes. This could be done with -x argument, but that would require users to store XidMap .

Alternate solutions:

Each mutation is an upsert mutation. PR

Pros: Faster
Cons: Alpha crashes around 8 million RDFs.

(pprof) top
Showing nodes accounting for 1516.21MB, 97.42% of 1556.36MB total
Dropped 138 nodes (cum <= 7.78MB)
Showing top 10 nodes out of 102
      flat  flat%   sum%        cum   cum%
  599.84MB 38.54% 38.54%   599.84MB 38.54%  github.com/DataDog/zstd.Decompress
  384.79MB 24.72% 63.26%   384.79MB 24.72%  github.com/dgraph-io/ristretto.newCmRow
  214.39MB 13.77% 77.04%   214.39MB 13.77%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
  166.41MB 10.69% 87.73%   166.41MB 10.69%  github.com/dgraph-io/badger/v2/skl.newArena
   67.36MB  4.33% 92.06%   110.32MB  7.09%  github.com/dgraph-io/badger/v2/table.OpenTable

(pprof) top
Showing nodes accounting for 204.01GB, 47.87% of 426.18GB total
Dropped 706 nodes (cum <= 2.13GB)
Showing top 10 nodes out of 229
      flat  flat%   sum%        cum   cum%
   48.51GB 11.38% 11.38%    53.71GB 12.60%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
   23.70GB  5.56% 16.94%    23.70GB  5.56%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
   23.51GB  5.52% 22.46%    23.51GB  5.52%  encoding/json.(*decodeState).literalStore
   22.88GB  5.37% 27.83%    22.88GB  5.37%  go.opencensus.io/trace.(*Span).interfaceArrayToAnnotationArray
   21.73GB  5.10% 32.93%    21.73GB  5.10%  go.opencensus.io/trace.copyAttributes
   20.31GB  4.77% 37.70%    20.31GB  4.77%  github.com/dgraph-io/dgraph/lex.(*Lexer).Emit
   12.35GB  2.90% 40.59%    17.61GB  4.13%  github.com/dgraph-io/badger/v2/table.(*Table).blockOffsets

The blank node acts as a xid. Whenever we are generating an NQuad, we would get the blank node to UID from XidMap. We can intercept at this point. Instead of leasing another UID from zero, we ask from Dgraph by doing an upsert mutation first. PR

Pros: Lower upserts required, leading to low memory usage
Cons: Slower, as we would have to query for each request separately. (26 mins for 21 million dataset)

MichelDiz · July 22, 2020, 5:15pm

What is the reason? OOM? On liveloader instance or Dgraph nodes?

Have you tested with ludicrous Mode?

I have questions about the procedure. How does it work? Reading the test file feels like it is simple, but not sure how the user would do it on the user end.

Do I have to give an upsert query in the RDF body? (unlikely based on the code)
Dgraph analyzes/infer the RDF and generates the Upsert query? How?
The user just needs to give any RDF and set the upsert flag?

harshil_goel · July 22, 2020, 6:37pm

On Dgraph instance. Our heap data is very less, so either we have a lot of data mmaped onto ram, or we allocate and deallocate a lot.

Upserts don’t work with ludicrous mode

That’s it. That’s how the user would use it. All we are supporting is that blank node would be saved in a predicate. So instead of saving blank node to uid mapping on disk, we store it in Dgraph. The user just needs to give any RDF and set the upsert flag.

amaster507 · July 22, 2020, 7:17pm

This would also help in debugging import scripts where the script maps a database to NQuads with blank nodes. After the import is is almost impossible to map up blank nodes back to uids to figure out where in the generated NQuads the bug lies which will lead to finding the bug in the db->nQuad mapping script.

MichelDiz · July 22, 2020, 7:55pm

That’s was also what I thought cuz it is a transaction-based, but it works.

So, now --store_xids will be useful, got it.

Topic		Replies	Views
Support Batch Upserts in Live Loader Dgraph dgraph , kind:enhancement , status:accepted , area:upsert , area:live-loader	8	1018	October 12, 2022
Dgraph Enhancement Proposal: bulk + live loader? Dgraph	2	602	August 9, 2019
Duplicate Nodes while using live loader Dgraph dgraph	1	393	November 12, 2020
Bulk loader -x option Users mutation	7	815	May 9, 2020
Realtime streaming graph data Dgraph	4	1139	December 20, 2019

Bulk Upsert in Live Loader

Related topics