Data availability on broken quorum question

sirkon · March 15, 2019, 2:58pm

Hi! We are considering dgraph for our internal usage and it is pretty crucial for us to have a read only access to data even at the case if there’s no quorum.

Am I understand it right there will be no kind of access in this case?

MichelDiz · March 15, 2019, 4:31pm

Hi Sirkon, welcome!

We have a Read Only flag dgraph/run.go at 90b2260cd613773182dc66dc4188a91e1ebf6d7c · dgraph-io/dgraph · GitHub

About the last question I’ll check with others or in the code and testing.

sirkon · March 18, 2019, 10:06am

Oh my god, my questions are so questions.

I am sorry. I meant we may need read-only access even in the case a cluster is broken and there’s no quorum.

MichelDiz · March 18, 2019, 3:15pm

Okay, a single Alpha group I think the quorum issue it’s inevitable. But if you have different Alpha Groups it’s okay to lose all groups and only one remains or two. For this you can use some load balancing (HTTP).

Speaking of a Single Alpha, quorum is important. For Dgraph treats the group as one. If you lose too many nodes within a group. There may be inaccessibility of some predicates. That they may be sharded.

So just setup N Alpha groups (read-only or not) and use load balancing that you’ll be okay.

Also see:

Consistent Replication

If --replicas flag is set to something greater than one, Zero would assign the same group to multiple nodes. These nodes would then form a Raft group aka quorum. Every write would be consistently replicated to the quorum. To achieve consensus, its important that the size of quorum be an odd number. Therefore, we recommend setting --replicas to 1, 3 or 5 (not 2 or 4). This allows 0, 1, or 2 nodes serving the same group to be down, respectively without affecting the overall health of that group.

I think the docs are a bit confusing with regards to a quorum. Quorum is the minimum number of nodes that must be present for writes to go through which is n/2 + 1 if a group has n servers.
Q: Servers, replicas, groups, shards, and tablets. Do I understand this correctly? · Issue #2171 · dgraph-io/dgraph · GitHub

vitalyisaev2 · March 18, 2019, 3:43pm

Hello @MichelDiz! Thanks for a prompt reply. Could you please clarify some points considering this example. Say we have cluster of 7 nodes: 1 Zero node and 6 Alpha nodes splitted to 2 groups (nodes 1, 2, 3 - to group 1, nodes 4, 5, 6 - to group 2).

Is it true that half of the total amount of keys will be placed in the first group while the other half will go to the second group? So the groups do not intersect by keys.
If one node within a group goes down, are writes are possible?
If two nodes within a group go down, are reads are possible?

Thank you!

dmai · March 18, 2019, 6:30pm

Multiple Alpha groups means the data is sharded across the cluster. Each predicate is wholly owned by a single group. Rebalancing heuristics will try to keep the data sizes among the groups balanced.

If one node goes down, then a majority (two nodes) of the group are still up and writes will still succeed. If and when the downed node comes back up it would receive a snapshot from the leader to catch up to the latest state.

When a majority is down, then transactions are not possible in the system. Transactional writes and reads will not be available, but best-effort queries—which relax the linearizable-read guarantees—would still be available.

sirkon · March 19, 2019, 10:47am

@dmai @MichelDiz thank you! This is exactly what we needed: there’s read only access with broken quorum via BestEffort!

vitalyisaev2 · March 19, 2019, 10:49am

@dmai thanks for explanation!

@MichelDiz just one last question, you have mentioned the HTTP Balancing above. Do you mean that we need to develop a distinct service which will:

Track cluster membership (operational and failed nodes).
Take into account key sharding (distribution of keys between groups).
Provide Dgraph-compatible API to proxy client requests to the nodes that are still alive and belong to the group that contains the key.

Thank you!

MichelDiz · March 19, 2019, 3:36pm

That was just a tip. Maybe you do not even need it. I figured in case of no quorum (less than two available Nodes) in one Alpha Group Dgraph would disable access on that Alpha (write/read) so depending on your configuration a smart HTTP balance could help. But in fact Dgraph Zero Group does this rebalancing work. If all your Alphas Groups are set up for the same Alpha Zero Group. Certainly you will be redirected to a group with quorum. But this need to be tested, I never had a huge setup and test this case. But I’ll do it to see in practice. (confirmed by Manish below)

Each group should typically be served by atleast 3 servers, if available. In the case of a machine failure, other servers serving the same group can still handle the load in that case.
Get started with Dgraph

Although, I would (IMHO) recommend a LB. I use Traefik in my projects for external access. I find it healthy approach to send different clients in a balanced way between the Alphas.

I did a small graphical visualization about a Cluster. For others to follow.

Cheers.

dmai · March 19, 2019, 6:52pm

Dgraph-specific membership is already handled by Dgraph. When using a load balancer it’s always a good idea to do health checks to remove unhealthy endpoints from the load balancer. Dgraph provides a /health endpoint that currently does basic health checks (is the process up and a healthy Raft member). You can check transactional healthiness by using something like the dgraph increment tool to test that a particular Alpha can service transactions.

These aren’t necessary to configure from a load balancer. Every Alpha knows the predicate–group membership information. When an Alpha receives a query it will internally go out to the corresponding groups for each predicate. From the client’s perspective, any Alpha in the cluster is a valid choice.

mrjn · March 19, 2019, 8:54pm

As @dmai mentions, you just need to shoot the query off to any Alpha in the cluster. Every Alpha knows the state of the cluster (tracked by Zeros and streamed out to Alphas), so they can redirect queries internally. You could however, put a load balancer in front of the Alphas to spread your query volume across the cluster.

vitalyisaev2 · March 20, 2019, 8:31am

Thanks everyone, that was very helpful!

Topic		Replies	Views
Dgraph readonly transaction quorum required? Dgraph kind:question	0	336	July 14, 2022
Problems with node removal Dgraph cluster	14	1391	January 7, 2021
Why can't Dgraph work well when I kill one node in the cluster Users	9	2076	November 6, 2017
Production instance is taking entire load for cluster Users	7	725	October 22, 2019
High Availability Across AZs Configuration Dgraph	0	327	November 13, 2021

Data availability on broken quorum question

Related topics