What version of Dgraph are you using?
Have you tried reproducing the issue with the latest release?
What is the hardware spec (RAM, OS)?
alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3
Steps to reproduce the issue.
There are two scenarios when this was observed:
Trigger a surge of mutations (insert/upsert/delete)
Over a period of time memory usage keeps increasing eventually almost fills up (no OOM)
Expected behaviour and actual result.
Memory usage for zero leader should be able to recover automatically.
Memory usage increases consistently and eventually slows down queries/mutations. Increase in memory speeds up when we increase rate of mutations and eventually alpha pods also start crashing.
What you wanted to do
Let the zero process automatically recover (or another zero process should have become leader)
What you actually did
Rolling restart of zero statefulset to recover cluster stability.
After the restart zero process uses very less amount of memory (5 gb -> ~500 mb). There seems to be a memory leak which is causing this issue.