We have a six node cluster running v21.03.1 in AWS - three Zeros and three Alphas. The Zeros and Alphas sit behind their own AWS Auto-Scaling Groups (ASGs). The Zero replicas flag is set to 3, so there is only one Dgraph Group.
We had an Alpha server die unexpectedly.
The ASG launched a new server to replace it. Since the server that died was not removed from the cluster cleanly using the removeNode API and the replicas flag is set to 3, the new server was launched into a second Dgraph group.
The predicates were rebalanced and some of the tablets/predicates were moved to this new server in Group 2.
According to the /state API, group 2 consisted of only one server, the new one.
Since Group 2 only had one server, it seemed like it could not serve some queries. For example, when a certain query was sent to the Alphas via the loadbalancer, sporadically the response contained only the Dgraph types and no data.
To investigate, I ran the query on all three servers individually using the curl command, I confirmed that the new server in Group 2 did not return any results, whereas the other two servers did return results.
I took the following actions:
- Moved the tablets on the Group 2 server back to Group 1 using the moveTablet API.
- Removed both the server that had died and the new Group 2 server from the cluster metadata user the removeNode API.
- Terminated the Group 2 server
I did not see a way to remove the empty Group 2 from the cluster metadata.
When the new server started up and rejoined the cluster, it joined Group 2 and I ran into the same issue with Dgraph moving predicates to the Group 2 server, and not being able respond to certain queries.
It seems like I will need to rebuild the cluster to cleanly resolve this, but if an Alpha server dies again, I’ll run into the same issue. I’m open to suggestions to avoid this.
My questions are:
- Should one of the other pre-existing servers been added to Group 2 as a replica?
- If a Dgraph group does not have a quorum, should Dgraph rebalance predicates to it’s servers?
- How can an empty group be removed?
- I saw a reference from @dmai about disabling rebalancing. How is that done after a cluster is created? Is this restarting the Zero processes with --rebalance_interval duration set to zero?
Thank you.