Consistent Increase in memory usage for zero leader

killerknv · September 25, 2020, 1:15pm

What version of Dgraph are you using?

v20.07.0

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3

Steps to reproduce the issue.

There are two scenarios when this was observed:

Trigger a surge of mutations (insert/upsert/delete)

Screenshot 2020-09-25 at 6.36.47 PM3184×1120 389 KB
Over a period of time memory usage keeps increasing eventually almost fills up (no OOM)

Screenshot 2020-09-25 at 6.40.34 PM1542×930 132 KB

Expected behaviour and actual result.

Memory usage for zero leader should be able to recover automatically.

Memory usage increases consistently and eventually slows down queries/mutations. Increase in memory speeds up when we increase rate of mutations and eventually alpha pods also start crashing.

What you wanted to do

Let the zero process automatically recover (or another zero process should have become leader)

What you actually did

Rolling restart of zero statefulset to recover cluster stability.

After the restart zero process uses very less amount of memory (5 gb → ~500 mb). There seems to be a memory leak which is causing this issue.

MichelDiz · September 25, 2020, 3:34pm

This seems to be a duplicated of

MichelDiz · September 25, 2020, 4:24pm

hey @killerknv would you mind sharing your memory profile? https://dgraph.io/docs/howto/retrieving-debug-information/#memory-profile

killerknv · September 25, 2020, 5:19pm

@MichelDiz Seems to be a memory leak This issue mentions memory leak with alpha process however we are observing memory issues with zero.

I will be capturing the memory profile next time this issue occurs.

Current profile for the zero leader:

File: dgraph
Build ID: 9e4f0ee8831675148f4e65856b61a661f45ae84e
Type: inuse_space
Time: Sep 25, 2020 at 11:42pm (+0630)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 100
Showing nodes accounting for 648.37MB, 95.78% of 676.96MB total
Dropped 150 nodes (cum <= 3.38MB)
      flat  flat%   sum%        cum   cum%
  496.51MB 73.34% 73.34%   496.51MB 73.34%  github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Oracle).commit
   83.20MB 12.29% 85.63%    83.20MB 12.29%  github.com/dgraph-io/badger/v2/skl.newArena (inline)
   19.68MB  2.91% 88.54%    19.68MB  2.91%  github.com/dgraph-io/badger/v2.(*Txn).modify
      13MB  1.92% 90.46%       13MB  1.92%  github.com/dgraph-io/dgo/v200/protos/api.(*TxnContext).Unmarshal
       7MB  1.03% 91.50%        7MB  1.03%  github.com/golang/snappy.NewReader (inline)
    6.84MB  1.01% 92.51%     6.84MB  1.01%  github.com/dgraph-io/dgraph/protos/pb.(*MembershipState).Marshal
    4.98MB  0.74% 93.24%     4.98MB  0.74%  reflect.mapassign
    4.78MB  0.71% 93.95%     6.06MB   0.9%  github.com/dgraph-io/badger/v2/y.(*WaterMark).process.func1
    4.50MB  0.66% 94.61%     4.50MB  0.66%  github.com/dgraph-io/badger/v2.(*DB).newTransaction
    3.82MB  0.56% 95.18%     3.82MB  0.56%  bytes.makeSlice
    3.08MB  0.45% 95.63%     4.58MB  0.68%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
    0.98MB  0.14% 95.78%     9.98MB  1.47%  github.com/gogo/protobuf/proto.mergeAny

killerknv · September 25, 2020, 5:37pm

@MichelDiz Just went through the whole discussion, Memory profile seems pretty similar for zero. We will try using smaller mutations to avoid running into this issue. Let me know If I should check anything else.

mrjn · September 26, 2020, 10:37pm

@ibrahim this commit map should be moved to use a balanced BST with mmap. Let’s assign that asap.

ibrahim · September 27, 2020, 1:45pm

@ashishgoswami is working on this.

ibrahim · October 13, 2020, 3:37pm

Hi @killerknv, just an update - we’re working on it and I’ll share a binary/branch with you once we have the fix for this ready.

Topic		Replies	Views
Seems to be a memory leak Dgraph status:accepted , ticket:created	26	2204	September 29, 2020
Extreme memory usage when constantly query and mutate data Dgraph	5	1758	February 5, 2020
Dgraph can't idle without being oomkilled after large data ingestion Dgraph	63	3862	September 14, 2020
Dgraph runs into a error loop and freezes the host Users	20	2225	February 21, 2018
Realtime streaming writing performence is so bad Dgraph kind:question	5	440	October 26, 2020