We have received a reqest for upserts in live loader from multiple users. This would allow people to run the live loader again with same data, without it making new nodes. This could be done with -x
argument, but that would require users to store XidMap
.
Alternate solutions:
- Each mutation is an upsert mutation. PR
- Pros: Faster
- Cons: Alpha crashes around 8 million RDFs.
(pprof) top
Showing nodes accounting for 1516.21MB, 97.42% of 1556.36MB total
Dropped 138 nodes (cum <= 7.78MB)
Showing top 10 nodes out of 102
flat flat% sum% cum cum%
599.84MB 38.54% 38.54% 599.84MB 38.54% github.com/DataDog/zstd.Decompress
384.79MB 24.72% 63.26% 384.79MB 24.72% github.com/dgraph-io/ristretto.newCmRow
214.39MB 13.77% 77.04% 214.39MB 13.77% github.com/dgraph-io/ristretto/z.(*Bloom).Size
166.41MB 10.69% 87.73% 166.41MB 10.69% github.com/dgraph-io/badger/v2/skl.newArena
67.36MB 4.33% 92.06% 110.32MB 7.09% github.com/dgraph-io/badger/v2/table.OpenTable
(pprof) top
Showing nodes accounting for 204.01GB, 47.87% of 426.18GB total
Dropped 706 nodes (cum <= 2.13GB)
Showing top 10 nodes out of 229
flat flat% sum% cum cum%
48.51GB 11.38% 11.38% 53.71GB 12.60% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
23.70GB 5.56% 16.94% 23.70GB 5.56% github.com/dgraph-io/ristretto/z.(*Bloom).Size
23.51GB 5.52% 22.46% 23.51GB 5.52% encoding/json.(*decodeState).literalStore
22.88GB 5.37% 27.83% 22.88GB 5.37% go.opencensus.io/trace.(*Span).interfaceArrayToAnnotationArray
21.73GB 5.10% 32.93% 21.73GB 5.10% go.opencensus.io/trace.copyAttributes
20.31GB 4.77% 37.70% 20.31GB 4.77% github.com/dgraph-io/dgraph/lex.(*Lexer).Emit
12.35GB 2.90% 40.59% 17.61GB 4.13% github.com/dgraph-io/badger/v2/table.(*Table).blockOffsets
- The blank node acts as a xid. Whenever we are generating an NQuad, we would get the blank node to UID from XidMap. We can intercept at this point. Instead of leasing another UID from zero, we ask from Dgraph by doing an upsert mutation first. PR
- Pros: Lower upserts required, leading to low memory usage
- Cons: Slower, as we would have to query for each request separately. (26 mins for 21 million dataset)