Problems with node removal

Valdanito · November 9, 2020, 6:38am

I use the dgraph cluster deployed by helm, and the dgraph version is v20.07.2.

When 1 to 2 nodes are removed, there is a 1/5 to 1/3 probability that an error will be reported. When 3 to 4 nodes are removed, the cluster is almost unavailable.

This is the status and log of the pod:

MichelDiz · November 9, 2020, 7:26am

You are using replica 3 with 6 alphas. This means that you have two groups. If you remove two nodes from a group it will fail. Dgraph can work with at least 2 nodes per group. It needs a quorum to maintain availability. If you go about removing the nodes, you are killing your cluster.

Also, you are removing the nodes from the cluster setup. That means they won’t recover. Nodes are recoverable if you don’t manually remove them from the cluster setup.

Maybe you are doing or saying something else, can you clarify?

MichelDiz · November 9, 2020, 7:44am

Hey @dmai - @pawan, I see something about this here. He’s doing repeated queries while removing nodes to demonstrate the problem. This looks like another problem that has already happened internally.

Although he is forcing a remove. It seems that when eliminating the node, Zero (maybe?) Is trying to communicate with the lost alpha and then it returns an empty query.

I think that’s the problem, despite the wrong circumstance. @Valdanito You should force a crash instead of a remove node. However, it still shouldn’t happen I think.

@Valdanito are you using Read Only and Best Effort?

Valdanito · November 9, 2020, 7:51am

I’m just using the default request method.

Valdanito · November 9, 2020, 8:17am

I’ve tried to remove the /dgraph/doneinit file first and then delete the pod, but even if only one pod is removed, the cluster becomes unstable.

I don’t want to modify the cluster setup in the helm file. I just want to verify under what conditions the dgraph HA cluster is stable.

pawan · November 10, 2020, 8:23am

Yeah, this looks like a bug. Since two out of three nodes in a group are always available the queries should return a consistent response. @Valdanito could you also try this on the v20.11-rc Release release/v20.11-rc1 · dgraph-io/dgraph · GitHub and if it still happens for you, we can look into it?

Valdanito · November 10, 2020, 8:29am

OK, I’ll try.

Valdanito · November 10, 2020, 8:31am

I tried to start the cluster with docker. 2 groups with 3 alpha in each group. When I stop one alpha in each group, the cluster can be used as you said. When the third alpha is deleted, the cluster becomes unstable.

pawan · November 10, 2020, 8:54am

So when the second alpha in a group is deleted then the cluster starts responding unpredictably as you show above? How long does it continue to behave like that?

Valdanito · November 10, 2020, 9:16am

always

I have just tried five alpha in each group (docker cluster). When I delete the fifth alpha, the cluster becomes unstable.

The conclusion I have now is that the cluster can make stable queries only if more than half of the alpha in each group is healthy.

But in k8s, any alpha failure will bring instability. According to the statefulset details, I guess it may be because k8s always thinks that the alpha pod is running.

First I removed four alphas in ratel, Then I told k8s that I only need 6 alphas now.

kubectl -n graphql scale statefulset.apps/graphql-dgraph-alpha --replicas=6

kubectl -n graphql delete pod graphql-dgraph-alpha-9
kubectl -n graphql delete pod graphql-dgraph-alpha-8
kubectl -n graphql delete pod graphql-dgraph-alpha-4
kubectl -n graphql delete pod graphql-dgraph-alpha-3

Then, 8 and 9 will not restart, but 3 and 4 will restart continuously, but always fail.

The alpha statefulset details:

Replicas:           6 desired | 10 total
Update Strategy:    RollingUpdate
Pods Status:        8 Running / 0 Waiting / 0 Succeeded / 0 Failed

This way does not seem to work. k8s always creates pods by sequence number.

If I delete /dgraph/doneinit as I did before, then delete the pod, the pod will not restart continuously, but the cluster is still unstable.

pawan · November 10, 2020, 11:37am

So it seems like maybe there is something wrong with your k8 config because the behavior that you are seeing is unexpected. @joaquin can help with that.

Valdanito · November 11, 2020, 1:53am

@pawan When I run the cluster with docker, do you think this conclusion is correct? For a group of 5 alpha, the cluster will be unavailable after deleting 3.

pawan · November 11, 2020, 8:45am

For doing writes, the majority of nodes need to be up so 3/5. Best effort reads are possible even if 3 out of 5 nodes are not up but linearizable reads require the majority of nodes to be up in the cluster.

Valdanito · November 11, 2020, 8:48am

Thanks for your answer.

Valdanito · January 7, 2021, 8:57am

Hi @pawan, I did a HA test again.

In 20.11.0, I deleted two alpha in each group (3 alpha) of docker cluster, and the cluster is still available.
But in k8s cluster, even if only one alpha pod is deleted, the cluster becomes unstable.

I first delete a pod in the cluster management interface of ratel, and then delete the /dgraph/doneinit file in the pod, so that the pod will not restart automatically.

NAME                                           READY   STATUS     RESTARTS   AGE
dgraph-twitter-dgraph-alpha-0                  0/1     Init:0/1   0          12m
dgraph-twitter-dgraph-alpha-1                  1/1     Running    0          4h40m
dgraph-twitter-dgraph-alpha-2                  1/1     Running    0          4h39m
dgraph-twitter-dgraph-alpha-3                  1/1     Running    0          4h38m
dgraph-twitter-dgraph-alpha-4                  1/1     Running    0          4h37m
dgraph-twitter-dgraph-alpha-5                  1/1     Running    2          4h36m
dgraph-twitter-dgraph-ratel-6d6d6d99bd-7zgfs   1/1     Running    0          6h41m
dgraph-twitter-dgraph-zero-0                   1/1     Running    0          6h41m

But the other two alpha are still trying to connect to alpha 0, which leads to the API error of health check.

Connection lost with dgraph-twitter-dgraph-alpha-0.dgraph-twitter-dgraph-alpha-headless.dgraph.svc.cluster.local:7080. Error: rpc error: code = Canceled desc = context canceled

Topic		Replies	Views
Remove zero node Dgraph dgraph	6	826	April 13, 2021
Remove bad node in k8s after error with storage (Alpha node) and Recover Alpha Dgraph	6	830	October 7, 2021
Dgraph Alpha Node unresponsive Dgraph	10	1151	September 10, 2022
Alpha keep sending votes to deleted nodes,cannot select leader Dgraph	7	908	February 25, 2021
Changing replication Dgraph kind:question	7	656	October 11, 2020

Problems with node removal

Related topics