Dgraph in k8s (GKE) health check issues

We are running into a recurring problem with dgraph on GKE and the dgraph HA deployment.

If for whatever reason one of the alpha pods is taken offline unexpectedly due to underlying node failure, even if k8s reboots the instance, that alpha pod will never be able to rejoin the cluster via raft.

On top of that, the health check reports healthy even if the alpha is not capable of serving requests under this scenario. This causes the k8s services to reroute requests to this which will cause 1/3 of your requests to timeout.

To replicate:

  1. Start a dgraph deployment using the Dgraph HA yaml in on the deployment page.
  2. Deliberately remove one of the nodes that is hosting a dgraph alpha pod.
  3. K8s will attempt to redeploy dgraph alpha onto one of the still working nodes, but dgraph will never reconnect. (will always complain about unhealthy connection).
  4. The health check will continue to report healthy even if the pod has not been able to connect to raft and is not able to actually handle requests.

Interesting, so the health check is not working as a check if the alpha is a healthy member of the cluster but rather just appears to be healthy in and of itself irrespective to its involvement in the cluster. And this causes the problem because the HA demployment in the k8s uses this healthy status to determine whether or not to include this alpha in the distribution of the request load.

I don’t have much experience with the deployment of HA clusters, but isn’t there some kind of health check somewhere concerning the cluster state and not just the individual nodes of the cluster? Does this cluster health not indicate what alphas are healthy and actively part of the cluster? This in my opinion should be used to distribute requests, not the individual health reports from the alphas.