I also work with Jay and can share the configuration we’re running dgraph in, if that helps.
I’m deploying the 0.0.19 helm chart with these changes to values.yaml in namespace dgraph, helm release named ‘draph’ too.
zero.persistence.enabled=true
zero.replicaCount=5
alpha.persistence.enabled=true
alpha.replicaCount=5
alpha.extraFlags="–security whitelist=10.1.0.0:10.1.255.255"
image.tag=v21.12.0
I verified that dgraph alpha gets configured as follows within the statefulSet and has the hostnames of all the zero pods:
dgraph alpha --my=$(hostname -f | awk ‘{gsub(/.$/,""); print $0}’):7080 --zero dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-1.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-2.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-3.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-4.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --security whitelist=10.1.0.0:10.1.255.255
What it looks like to me from the logs is that alpha isn’t retrying other zero nodes soon enough to find which one has become the leader when dgraph-dgraph-zero-4 became unavailable.
Leader election in zero happens at 20:39:20
dgraph-dgraph-zero-2 dgraph-dgraph-zero I0805 20:39:20.967599 19 log.go:34] 3 became leader at term 577
Alpha doesn’t see this until 20:39:24
dgraph-dgraph-alpha-1 dgraph-dgraph-alpha I0805 20:39:23.204873 19 groups.go:867] Got address of a Zero leader: dgraph-dgraph-zero-4.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080
dgraph-dgraph-alpha-2 dgraph-dgraph-alpha I0805 20:39:24.305603 18 groups.go:867] Got address of a Zero leader: dgraph-dgraph-zero-2.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080