Continuing the discussion from Replicas across availability zones:
I am planning to implement the same thing and looking to prevent having the entire cluster go down when an AZ goes down. I am a little lost in the terminology, most specifically the seemingly interchangeable use of servers, nodes, alphas and zeros.
In the second diagram from @nickpoorman’s post if he lost say, AZ-1, would the cluster go down because there is not a majority of alphas to run Group1? Also, when it says the number of replicas is the number of nodes for a group I assume that refers to the number of Alphas and that the number of zeros does not come into play for a quorum? I also assume that the number of Alpha nodes must be a multiple of the number of replicas specified for the failover to work properly?
In the raft=‘group=…’ I am assuming that refers to the raft id provided in the zero command so given a zero command of…
dgraph zero --my=zero0:5080 --replicas 3 --raft=“idx=0”
then the alpha command would be
dgraph alpha --my=alpha1:7080 --zero=zero0:5080 --raft=‘group=0’
From which I infer that I need to have at least one zero for each group. May I have 2 zeros per group
that I could distribute across AZs equally? as in…
AZ-1
dgraph zero --my=zero0:5080 --replicas 3 --raft=“idx=0”
AZ-2
dgraph zero --my=zero0:5080 --replicas 3 --raft=“idx=0”
So, to keep the cluster up when an AZ goes down I would need replicas=5 and cross at least 3 AZs (ignoring latency issues for now)
AZ-1 Alphas
Group-1, Group-1, Group-2
Zero0
AZ-2 Alphas
Group-1, Group-1, Group-2, Group-2
Zero1
AZ-3 Alphas
Group-2, Group-2, Group-1
Zero2
And in that way losing one AZ would never take out a majority of the Alphas in a quorum.
Is that correct? And how many zeros should I have and how should I distribute them?