[k8s/helm] Alpha pods terminate only after terminationGracePeriodSeconds

killerknv · September 23, 2020, 1:10pm

What version of Dgraph are you using?

v20.07.0

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

alpha pods: 12 * (14 G, 6 cpu)
zero pods: 3 * (5G, 4 cpu)
shard replica count: 3

Steps to reproduce the issue (command/config used to run Dgraph).

Trigger statefulset update

Expected behaviour and actual result.

Alpha server should stop before terminationGracePeriodSeconds.

E0923 11:31:31.956404      18 run.go:395] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection
E0923 11:31:31.956460      18 run.go:414] Stopped taking more http(s) requests. Err: accept tcp [::]:8080: use of closed network connection
I0923 11:31:31.956559      18 run.go:719] GRPC and HTTP stopped.
I0923 11:31:31.956567      18 worker.go:113] Stopping group...
I0923 11:31:31.956593      18 worker.go:117] Updating RAFT state before shutting down...
E0923 11:31:31.956597      18 groups.go:812] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
E0923 11:31:31.956729      18 groups.go:913] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0923 11:31:31.959073      18 worker.go:122] Stopping node...
I0923 11:31:31.959095      18 draft.go:951] Stopping node.Run
I0923 11:31:31.959113      18 draft.go:187] Stopped all ongoing registered tasks.
I0923 11:31:31.959128      18 log.go:34] b [term 8] starts to transfer leadership to c
I0923 11:31:31.959129      18 draft.go:115] Operation completed with id: opRollup
I0923 11:31:32.959236      18 draft.go:1020] Raft node done.
I0923 11:31:32.959257      18 worker.go:125] Stopping raftwal store...
I0923 11:31:32.959265      18 worker.go:128] Stopping worker server...

Stop worker is triggered within a few seconds but process still stuck and eventually terminated after terminationGracePeriodSeconds (set to 600 s).

What you wanted to do

Faster rollouts and recovery period.

Why that wasn’t great, with examples

Given that we have 12 pods and it takes 10 min for update on one pod. Overall it takes 120 min for cluster restart.

joaquin · September 23, 2020, 9:30pm

Are you using the default helm values minus the upsizing? Could you give us your helm chart values that are different than the defaults (w/o secrets, redact those)? Then what steps, commands used after deployment?

mrjn · September 23, 2020, 10:06pm

This is fixed in master. There were some deadlocks preventing the Alpha from getting stopped gracefully.

killerknv · September 24, 2020, 9:13am

@mrjn Is this fix also available in Release Dgraph v20.07.1 (Savvy Shuri-1) · dgraph-io/dgraph · GitHub . relase/v20.07 - Fix(Alpha): MASA: Make Alpha Shutdown Again (#6313) by jarifibrahim · Pull Request #6402 · dgraph-io/dgraph · GitHub seems to be the relevant PR.

Topic		Replies	Views
Dgraph Alpha does not shutdown when signaled to when it has not yet joined a cluster Dgraph dgraph , status:accepted , kind:bug , area:operations , priority:p3	0	629	September 20, 2019
Dgraph runs into a error loop and freezes the host Users	20	2219	February 21, 2018
Dgraph does not properly stop during clean shutdown Dgraph dgraph , status:accepted , kind:bug , area:operations , ticket:created	3	1395	September 16, 2020
DGraph deployment via helm not working anymore Users	4	1613	January 7, 2019
Alpha not responding to termination and gets stuck Dgraph status:accepted , kind:bug , ticket:created	1	641	October 5, 2020