[k8s/helm] Alpha pods terminate only after terminationGracePeriodSeconds

What version of Dgraph are you using?

v20.07.0

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3

Steps to reproduce the issue (command/config used to run Dgraph).

Trigger statefulset update

Expected behaviour and actual result.

Alpha server should stop before terminationGracePeriodSeconds.

E0923 11:31:31.956404      18 run.go:395] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection
E0923 11:31:31.956460      18 run.go:414] Stopped taking more http(s) requests. Err: accept tcp [::]:8080: use of closed network connection
I0923 11:31:31.956559      18 run.go:719] GRPC and HTTP stopped.
I0923 11:31:31.956567      18 worker.go:113] Stopping group...
I0923 11:31:31.956593      18 worker.go:117] Updating RAFT state before shutting down...
E0923 11:31:31.956597      18 groups.go:812] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
E0923 11:31:31.956729      18 groups.go:913] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0923 11:31:31.959073      18 worker.go:122] Stopping node...
I0923 11:31:31.959095      18 draft.go:951] Stopping node.Run
I0923 11:31:31.959113      18 draft.go:187] Stopped all ongoing registered tasks.
I0923 11:31:31.959128      18 log.go:34] b [term 8] starts to transfer leadership to c
I0923 11:31:31.959129      18 draft.go:115] Operation completed with id: opRollup
I0923 11:31:32.959236      18 draft.go:1020] Raft node done.
I0923 11:31:32.959257      18 worker.go:125] Stopping raftwal store...
I0923 11:31:32.959265      18 worker.go:128] Stopping worker server...

Stop worker is triggered within a few seconds but process still stuck and eventually terminated after terminationGracePeriodSeconds (set to 600 s).

What you wanted to do

Faster rollouts and recovery period.

Why that wasn’t great, with examples

Given that we have 12 pods and it takes 10 min for update on one pod. Overall it takes 120 min for cluster restart.

Are you using the default helm values minus the upsizing? Could you give us your helm chart values that are different than the defaults (w/o secrets, redact those)? Then what steps, commands used after deployment?

This is fixed in master. There were some deadlocks preventing the Alpha from getting stopped gracefully.

@mrjn Is this fix also available in https://github.com/dgraph-io/dgraph/releases/tag/v20.07.1 . https://github.com/dgraph-io/dgraph/pull/6402 seems to be the relevant PR.