Problems with node removal

I use the dgraph cluster deployed by helm, and the dgraph version is v20.07.2.

When 1 to 2 nodes are removed, there is a 1/5 to 1/3 probability that an error will be reported. When 3 to 4 nodes are removed, the cluster is almost unavailable.

This is the status and log of the pod:

You are using replica 3 with 6 alphas. This means that you have two groups. If you remove two nodes from a group it will fail. Dgraph can work with at least 2 nodes per group. It needs a quorum to maintain availability. If you go about removing the nodes, you are killing your cluster.

Also, you are removing the nodes from the cluster setup. That means they won’t recover. Nodes are recoverable if you don’t manually remove them from the cluster setup.

Maybe you are doing or saying something else, can you clarify?

Hey @dmai - @pawan, I see something about this here. He’s doing repeated queries while removing nodes to demonstrate the problem. This looks like another problem that has already happened internally.

Although he is forcing a remove. It seems that when eliminating the node, Zero (maybe?) Is trying to communicate with the lost alpha and then it returns an empty query.

I think that’s the problem, despite the wrong circumstance. @Valdanito You should force a crash instead of a remove node. However, it still shouldn’t happen I think.

@Valdanito are you using Read Only and Best Effort?

I’m just using the default request method.

I’ve tried to remove the /dgraph/doneinit file first and then delete the pod, but even if only one pod is removed, the cluster becomes unstable.

I don’t want to modify the cluster setup in the helm file. I just want to verify under what conditions the dgraph HA cluster is stable.

Yeah, this looks like a bug. Since two out of three nodes in a group are always available the queries should return a consistent response. @Valdanito could you also try this on the v20.11-rc https://github.com/dgraph-io/dgraph/releases/tag/release%2Fv20.11-rc1 and if it still happens for you, we can look into it?

OK, I’ll try.

I tried to start the cluster with docker. 2 groups with 3 alpha in each group. When I stop one alpha in each group, the cluster can be used as you said. When the third alpha is deleted, the cluster becomes unstable.

So when the second alpha in a group is deleted then the cluster starts responding unpredictably as you show above? How long does it continue to behave like that?

always

I have just tried five alpha in each group (docker cluster). When I delete the fifth alpha, the cluster becomes unstable.

The conclusion I have now is that the cluster can make stable queries only if more than half of the alpha in each group is healthy.


But in k8s, any alpha failure will bring instability. According to the statefulset details, I guess it may be because k8s always thinks that the alpha pod is running.

First I removed four alphas in ratel, Then I told k8s that I only need 6 alphas now.

kubectl -n graphql scale statefulset.apps/graphql-dgraph-alpha --replicas=6

kubectl -n graphql delete pod graphql-dgraph-alpha-9
kubectl -n graphql delete pod graphql-dgraph-alpha-8
kubectl -n graphql delete pod graphql-dgraph-alpha-4
kubectl -n graphql delete pod graphql-dgraph-alpha-3

Then, 8 and 9 will not restart, but 3 and 4 will restart continuously, but always fail.

The alpha statefulset details:

Replicas:           6 desired | 10 total
Update Strategy:    RollingUpdate
Pods Status:        8 Running / 0 Waiting / 0 Succeeded / 0 Failed

This way does not seem to work. k8s always creates pods by sequence number.

If I delete /dgraph/doneinit as I did before, then delete the pod, the pod will not restart continuously, but the cluster is still unstable.

So it seems like maybe there is something wrong with your k8 config because the behavior that you are seeing is unexpected. @joaquin can help with that.

@pawan When I run the cluster with docker, do you think this conclusion is correct? For a group of 5 alpha, the cluster will be unavailable after deleting 3.

For doing writes, the majority of nodes need to be up so 3/5. Best effort reads are possible even if 3 out of 5 nodes are not up but linearizable reads require the majority of nodes to be up in the cluster.

Thanks for your answer.