We are running into a recurring problem with dgraph on GKE and the dgraph HA deployment.
If for whatever reason one of the alpha pods is taken offline unexpectedly due to underlying node failure, even if k8s reboots the instance, that alpha pod will never be able to rejoin the cluster via raft.
On top of that, the health check reports healthy even if the alpha is not capable of serving requests under this scenario. This causes the k8s services to reroute requests to this which will cause 1/3 of your requests to timeout.
To replicate:
- Start a dgraph deployment using the Dgraph HA yaml in on the deployment page.
- Deliberately remove one of the nodes that is hosting a dgraph alpha pod.
- K8s will attempt to redeploy dgraph alpha onto one of the still working nodes, but dgraph will never reconnect. (will always complain about unhealthy connection).
- The health check will continue to report healthy even if the pod has not been able to connect to raft and is not able to actually handle requests.