killerknv
(Kesharee Nandan)
September 25, 2020, 1:15pm
1
What version of Dgraph are you using?
v20.07.0
Have you tried reproducing the issue with the latest release?
Yes
What is the hardware spec (RAM, OS)?
alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3
Steps to reproduce the issue.
There are two scenarios when this was observed:
Trigger a surge of mutations (insert/upsert/delete)
Over a period of time memory usage keeps increasing eventually almost fills up (no OOM)
Expected behaviour and actual result.
Memory usage for zero leader should be able to recover automatically.
Memory usage increases consistently and eventually slows down queries/mutations. Increase in memory speeds up when we increase rate of mutations and eventually alpha pods also start crashing.
What you wanted to do
Let the zero process automatically recover (or another zero process should have become leader)
What you actually did
Rolling restart of zero statefulset to recover cluster stability.
After the restart zero process uses very less amount of memory (5 gb → ~500 mb). There seems to be a memory leak which is causing this issue.
MichelDiz
(Michel Diz)
September 25, 2020, 3:34pm
2
This seems to be a duplicated of
Report a Dgraph Bug
What version of Dgraph are you using?
20.07.1-rc1
Have you tried reproducing the issue with the latest release?
already the latest release
What is the hardware spec (RAM, OS)?
40 GB of RAM on Debian 9
Steps to reproduce the issue (command/config used to run Dgraph).
I have a 3 node cluster on 3 machines (a zero + an alpha in each node), running with docker-compose,
i try to add some data with GraphQL API, in my case a User model with a friendship relation with itself,
th…
MichelDiz
(Michel Diz)
September 25, 2020, 4:24pm
3
killerknv
(Kesharee Nandan)
September 25, 2020, 5:19pm
4
@MichelDiz Seems to be a memory leak This issue mentions memory leak with alpha process however we are observing memory issues with zero.
I will be capturing the memory profile next time this issue occurs.
Current profile for the zero leader:
File: dgraph
Build ID: 9e4f0ee8831675148f4e65856b61a661f45ae84e
Type: inuse_space
Time: Sep 25, 2020 at 11:42pm (+0630)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 100
Showing nodes accounting for 648.37MB, 95.78% of 676.96MB total
Dropped 150 nodes (cum <= 3.38MB)
flat flat% sum% cum cum%
496.51MB 73.34% 73.34% 496.51MB 73.34% github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Oracle).commit
83.20MB 12.29% 85.63% 83.20MB 12.29% github.com/dgraph-io/badger/v2/skl.newArena (inline)
19.68MB 2.91% 88.54% 19.68MB 2.91% github.com/dgraph-io/badger/v2.(*Txn).modify
13MB 1.92% 90.46% 13MB 1.92% github.com/dgraph-io/dgo/v200/protos/api.(*TxnContext).Unmarshal
7MB 1.03% 91.50% 7MB 1.03% github.com/golang/snappy.NewReader (inline)
6.84MB 1.01% 92.51% 6.84MB 1.01% github.com/dgraph-io/dgraph/protos/pb.(*MembershipState).Marshal
4.98MB 0.74% 93.24% 4.98MB 0.74% reflect.mapassign
4.78MB 0.71% 93.95% 6.06MB 0.9% github.com/dgraph-io/badger/v2/y.(*WaterMark).process.func1
4.50MB 0.66% 94.61% 4.50MB 0.66% github.com/dgraph-io/badger/v2.(*DB).newTransaction
3.82MB 0.56% 95.18% 3.82MB 0.56% bytes.makeSlice
3.08MB 0.45% 95.63% 4.58MB 0.68% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
0.98MB 0.14% 95.78% 9.98MB 1.47% github.com/gogo/protobuf/proto.mergeAny
killerknv
(Kesharee Nandan)
September 25, 2020, 5:37pm
6
@MichelDiz Just went through the whole discussion, Memory profile seems pretty similar for zero. We will try using smaller mutations to avoid running into this issue. Let me know If I should check anything else.
mrjn
(Manish R Jain)
September 26, 2020, 10:37pm
7
@ibrahim this commit map should be moved to use a balanced BST with mmap. Let’s assign that asap.
ibrahim
(Ibrahim Jarif)
September 27, 2020, 1:45pm
8
@ashishgoswami is working on this.
ibrahim
(Ibrahim Jarif)
October 13, 2020, 3:37pm
12
Hi @killerknv , just an update - we’re working on it and I’ll share a binary/branch with you once we have the fix for this ready.