Continuing the discussion from Replicas across availability zones:

I am planning to implement the same thing and looking to prevent having the entire cluster go down when an AZ goes down. I am a little lost in the terminology, most specifically the seemingly interchangeable use of servers, nodes, alphas and zeros.

In the second diagram from @nickpoorman’s post if he lost say, AZ-1, would the cluster go down because there is not a majority of alphas to run Group1? Also, when it says the number of replicas is the number of nodes for a group I assume that refers to the number of Alphas and that the number of zeros does not come into play for a quorum? I also assume that the number of Alpha nodes must be a multiple of the number of replicas specified for the failover to work properly?

In the raft=‘group=…’ I am assuming that refers to the raft id provided in the zero command so given a zero command of…

dgraph zero --my=zero0:5080 --replicas 3 --raft=“idx=0”

then the alpha command would be

dgraph alpha --my=alpha1:7080 --zero=zero0:5080 --raft=‘group=0’

From which I infer that I need to have at least one zero for each group. May I have 2 zeros per group

that I could distribute across AZs equally? as in…

AZ-1

dgraph zero --my=zero0:5080 --replicas 3 --raft=“idx=0”

AZ-2

dgraph zero --my=zero0:5080 --replicas 3 --raft=“idx=0”

So, to keep the cluster up when an AZ goes down I would need replicas=5 and cross at least 3 AZs (ignoring latency issues for now)

AZ-1 Alphas

Group-1, Group-1, Group-2

Zero0

AZ-2 Alphas

Group-1, Group-1, Group-2, Group-2

Zero1

AZ-3 Alphas

Group-2, Group-2, Group-1

Zero2

And in that way losing one AZ would never take out a majority of the Alphas in a quorum.

Is that correct? And how many zeros should I have and how should I distribute them?