I use the dgraph cluster deployed by helm, and the dgraph version is v20.07.2.
When 1 to 2 nodes are removed, there is a 1/5 to 1/3 probability that an error will be reported. When 3 to 4 nodes are removed, the cluster is almost unavailable.
You are using replica 3 with 6 alphas. This means that you have two groups. If you remove two nodes from a group it will fail. Dgraph can work with at least 2 nodes per group. It needs a quorum to maintain availability. If you go about removing the nodes, you are killing your cluster.
Also, you are removing the nodes from the cluster setup. That means they won’t recover. Nodes are recoverable if you don’t manually remove them from the cluster setup.
Maybe you are doing or saying something else, can you clarify?
Hey @dmai - @pawan, I see something about this here. He’s doing repeated queries while removing nodes to demonstrate the problem. This looks like another problem that has already happened internally.
Although he is forcing a remove. It seems that when eliminating the node, Zero (maybe?) Is trying to communicate with the lost alpha and then it returns an empty query.
I think that’s the problem, despite the wrong circumstance. @Valdanito You should force a crash instead of a remove node. However, it still shouldn’t happen I think.
@Valdanito are you using Read Only and Best Effort?
Yeah, this looks like a bug. Since two out of three nodes in a group are always available the queries should return a consistent response. @Valdanito could you also try this on the v20.11-rc Release release/v20.11-rc1 · dgraph-io/dgraph · GitHub and if it still happens for you, we can look into it?
I tried to start the cluster with docker. 2 groups with 3 alpha in each group. When I stop one alpha in each group, the cluster can be used as you said. When the third alpha is deleted, the cluster becomes unstable.
So when the second alpha in a group is deleted then the cluster starts responding unpredictably as you show above? How long does it continue to behave like that?
I have just tried five alpha in each group (docker cluster). When I delete the fifth alpha, the cluster becomes unstable.
The conclusion I have now is that the cluster can make stable queries only if more than half of the alpha in each group is healthy.
But in k8s, any alpha failure will bring instability. According to the statefulset details, I guess it may be because k8s always thinks that the alpha pod is running.
First I removed four alphas in ratel, Then I told k8s that I only need 6 alphas now.
kubectl -n graphql scale statefulset.apps/graphql-dgraph-alpha --replicas=6
kubectl -n graphql delete pod graphql-dgraph-alpha-9
kubectl -n graphql delete pod graphql-dgraph-alpha-8
kubectl -n graphql delete pod graphql-dgraph-alpha-4
kubectl -n graphql delete pod graphql-dgraph-alpha-3
Then, 8 and 9 will not restart, but 3 and 4 will restart continuously, but always fail.
So it seems like maybe there is something wrong with your k8 config because the behavior that you are seeing is unexpected. @joaquin can help with that.
@pawan When I run the cluster with docker, do you think this conclusion is correct? For a group of 5 alpha, the cluster will be unavailable after deleting 3.
For doing writes, the majority of nodes need to be up so 3/5. Best effort reads are possible even if 3 out of 5 nodes are not up but linearizable reads require the majority of nodes to be up in the cluster.
In 20.11.0, I deleted two alpha in each group (3 alpha) of docker cluster, and the cluster is still available.
But in k8s cluster, even if only one alpha pod is deleted, the cluster becomes unstable.
I first delete a pod in the cluster management interface of ratel, and then delete the /dgraph/doneinit file in the pod, so that the pod will not restart automatically.