Huge index building memory growth

When creating indicies that result in large amounts of storage for the index, memory growth grows linearly and is nearly unmanageable on 64GiB systems. I have not looked into the code yet on why it follows this pattern, but I wanted to put this out there in case dgraph devs had any insights on why.

The amount of data on disk is not actually a crazy amount - here are some log lines from the beginning of the index rebuild work:

Rebuilding index for attr XXXX.name and tokenizers [trigram exact]
Rebuilding index for predicate 0-XXXX.name (1/2): Streaming about 12 GiB of uncompressed data (3.7 GiB on disk)

so you can see 12GiB uncompressed. However, the resulting index is roughly 170GiB (according to the logs). It is taking us over 32GiB of ram at the moment to process this single operation. I was wondering if we could do this in a way that streams the result to disk instead of using all this memory - I really assumed that it did that already.

See here the memory monitoring for that group vs other groups:


Note the big spike at the end - that is after the index is completely built - it is finalizing the index somehow and them boom, it OOMkills (that is the drop). These are 32GiB machines (and you see the other nodes use 3-4GiB normally), and I need to switch to 64GiB machines just to build the trigram index for this one predicate.

Can we get a pattern here that is constant memory usage? This linear pattern will not fly after it starts taking 64GiB just to do that.

What version of Dgraph are you using?

Dgraph Version
Dgraph version   : v21.03.2
Dgraph codename  : rocket-2
Dgraph SHA-256   : 00a53ef6d874e376d5a53740341be9b822ef1721a4980e6e2fcb60986b3abfbf
Commit SHA-1     : b17395d33
Commit timestamp : 2021-08-26 01:11:38 -0700
Branch           : HEAD
Go version       : go1.16.2
jemalloc enabled : true
1 Like

I moved my cluster onto 64GiB nodes so I could profile this reindexing. After 2h46m the stream into the temporary badgerdb finished with the below log messages:

Rebuilding index for predicate 0-XXXX.name (1/2): [02h46m10s] Scan (1): ~11.5 GiB/12 GiB at 936 KiB/sec. Sent: 170.0 GiB at 16 MiB/sec. jemalloc: 2.9 GiB
Rebuilding index for predicate 0-XXXX.name (1/2): Sent data of size 170 GiB

that second log statement is the one printed just after the writebatch is flushed and the stream from the temporary badgerdb is about to start streaming back into the main badger instance.

Note the inuse memory usage here climbs linearly during stage 1, then shoots up during stage 2: (scoped to one of my alphas doing the reindex)

observations:

  • Nearly all of the memory in stage 1 (streaming out to tempdb) is held by badger.txn.modify() (here)
    • This allocates to ~2 maps held in the txn. One is for conflict detection, the other is for pending writes.
    • temporary badger instance probably does not need conflict detection enabled, which would save a map key write.
    • pending writes needs to be written to but it should only have memory for the life of the txn object, which should be consistently rotated out automatically by badger.WriteBatch
  • When the stage 2 is reached (streaming from tempdb back into the real one) The memory from badger.txn.modify() is not released for the entirety of stage 2. This makes sense as the write batch object is still in-scope within rebuilder.Run() but does not make sense that the memory is being held in the first place (see bullet above about the txn maps and writebatch rotating them). Just setting the writebatch object to nil would probably let go take back a good amount here in stage 2.

memory profile collected in stage 1 (332.6 KB)
memory profile collected in stage 2 (343.5 KB)

If you want the full output from the dgraph debuginfo tool I have that too, but I cannot upload that to discourse directly so let me know if you want it.

1 Like