What version of Dgraph are you using?
Have you tried reproducing the issue with the latest release?
What is the hardware spec (RAM, OS)?
alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3
Steps to reproduce the issue (command/config used to run Dgraph).
Trigger statefulset update
Expected behaviour and actual result.
Alpha server should stop before terminationGracePeriodSeconds.
E0923 11:31:31.956404 18 run.go:395] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection E0923 11:31:31.956460 18 run.go:414] Stopped taking more http(s) requests. Err: accept tcp [::]:8080: use of closed network connection I0923 11:31:31.956559 18 run.go:719] GRPC and HTTP stopped. I0923 11:31:31.956567 18 worker.go:113] Stopping group... I0923 11:31:31.956593 18 worker.go:117] Updating RAFT state before shutting down... E0923 11:31:31.956597 18 groups.go:812] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil> E0923 11:31:31.956729 18 groups.go:913] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled I0923 11:31:31.959073 18 worker.go:122] Stopping node... I0923 11:31:31.959095 18 draft.go:951] Stopping node.Run I0923 11:31:31.959113 18 draft.go:187] Stopped all ongoing registered tasks. I0923 11:31:31.959128 18 log.go:34] b [term 8] starts to transfer leadership to c I0923 11:31:31.959129 18 draft.go:115] Operation completed with id: opRollup I0923 11:31:32.959236 18 draft.go:1020] Raft node done. I0923 11:31:32.959257 18 worker.go:125] Stopping raftwal store... I0923 11:31:32.959265 18 worker.go:128] Stopping worker server...
Stop worker is triggered within a few seconds but process still stuck and eventually terminated after terminationGracePeriodSeconds (set to 600 s).
What you wanted to do
Faster rollouts and recovery period.
Why that wasn’t great, with examples
Given that we have 12 pods and it takes 10 min for update on one pod. Overall it takes 120 min for cluster restart.