Dgraph in k8s (GKE) health check issues

ahctangU · July 15, 2021, 3:19pm

We are running into a recurring problem with dgraph on GKE and the dgraph HA deployment.

If for whatever reason one of the alpha pods is taken offline unexpectedly due to underlying node failure, even if k8s reboots the instance, that alpha pod will never be able to rejoin the cluster via raft.

On top of that, the health check reports healthy even if the alpha is not capable of serving requests under this scenario. This causes the k8s services to reroute requests to this which will cause 1/3 of your requests to timeout.

To replicate:

Start a dgraph deployment using the Dgraph HA yaml in on the deployment page.
Deliberately remove one of the nodes that is hosting a dgraph alpha pod.
K8s will attempt to redeploy dgraph alpha onto one of the still working nodes, but dgraph will never reconnect. (will always complain about unhealthy connection).
The health check will continue to report healthy even if the pod has not been able to connect to raft and is not able to actually handle requests.

amaster507 · July 15, 2021, 4:06pm

Interesting, so the health check is not working as a check if the alpha is a healthy member of the cluster but rather just appears to be healthy in and of itself irrespective to its involvement in the cluster. And this causes the problem because the HA demployment in the k8s uses this healthy status to determine whether or not to include this alpha in the distribution of the request load.

I don’t have much experience with the deployment of HA clusters, but isn’t there some kind of health check somewhere concerning the cluster state and not just the individual nodes of the cluster? Does this cluster health not indicate what alphas are healthy and actively part of the cluster? This in my opinion should be used to distribute requests, not the individual health reports from the alphas.

Topic		Replies	Views
Issues with Dgraph running in Kubernetes (K8 Loadbalancing?) Dgraph kind:bug	6	1123	October 7, 2020
Kubernetes HA -> Alpha CrashLoopBackOff Users status:accepted , kind:bug	7	1128	April 26, 2019
The health endpoint does not represent the alpha state correctly unless the all parameter is used Dgraph kind:bug	7	540	February 5, 2021
Zero's /health endpoint returns just "Ok" where the same Alpha's endpoint returns a JSON Dgraph dgraph , kind:enhancement , status:accepted , area:usability	7	615	June 30, 2020
[Devops / K8s / Docs] Production checkllist Dgraph dgraph , kind:enhancement , area:documentation , area:operations , area:kubernetes	3	673	April 10, 2020

Dgraph in k8s (GKE) health check issues

Related topics