Consistent Increase in memory usage for zero leader

What version of Dgraph are you using?


Have you tried reproducing the issue with the latest release?


What is the hardware spec (RAM, OS)?

alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3

Steps to reproduce the issue.

There are two scenarios when this was observed:

  1. Trigger a surge of mutations (insert/upsert/delete)

  2. Over a period of time memory usage keeps increasing eventually almost fills up (no OOM)

Expected behaviour and actual result.

Memory usage for zero leader should be able to recover automatically.

Memory usage increases consistently and eventually slows down queries/mutations. Increase in memory speeds up when we increase rate of mutations and eventually alpha pods also start crashing.

What you wanted to do

Let the zero process automatically recover (or another zero process should have become leader)

What you actually did

Rolling restart of zero statefulset to recover cluster stability.

After the restart zero process uses very less amount of memory (5 gb → ~500 mb). There seems to be a memory leak which is causing this issue.

This seems to be a duplicated of

hey @killerknv would you mind sharing your memory profile?

@MichelDiz Seems to be a memory leak This issue mentions memory leak with alpha process however we are observing memory issues with zero.

I will be capturing the memory profile next time this issue occurs.

Current profile for the zero leader:

File: dgraph
Build ID: 9e4f0ee8831675148f4e65856b61a661f45ae84e
Type: inuse_space
Time: Sep 25, 2020 at 11:42pm (+0630)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 100
Showing nodes accounting for 648.37MB, 95.78% of 676.96MB total
Dropped 150 nodes (cum <= 3.38MB)
      flat  flat%   sum%        cum   cum%
  496.51MB 73.34% 73.34%   496.51MB 73.34%*Oracle).commit
   83.20MB 12.29% 85.63%    83.20MB 12.29% (inline)
   19.68MB  2.91% 88.54%    19.68MB  2.91%*Txn).modify
      13MB  1.92% 90.46%       13MB  1.92%*TxnContext).Unmarshal
       7MB  1.03% 91.50%        7MB  1.03% (inline)
    6.84MB  1.01% 92.51%     6.84MB  1.01%*MembershipState).Marshal
    4.98MB  0.74% 93.24%     4.98MB  0.74%  reflect.mapassign
    4.78MB  0.71% 93.95%     6.06MB   0.9%*WaterMark).process.func1
    4.50MB  0.66% 94.61%     4.50MB  0.66%*DB).newTransaction
    3.82MB  0.56% 95.18%     3.82MB  0.56%  bytes.makeSlice
    3.08MB  0.45% 95.63%     4.58MB  0.68%*TableIndex).Unmarshal
    0.98MB  0.14% 95.78%     9.98MB  1.47%

@MichelDiz Just went through the whole discussion, Memory profile seems pretty similar for zero. We will try using smaller mutations to avoid running into this issue. Let me know If I should check anything else.

@ibrahim this commit map should be moved to use a balanced BST with mmap. Let’s assign that asap.

@ashishgoswami is working on this.

Hi @killerknv, just an update - we’re working on it and I’ll share a binary/branch with you once we have the fix for this ready.