My company has been PoCing a dgraph cluster for the last couple of weeks, and I wanted to ask a few questions about some behavior we have been seeing, and if it is expected. Our cluster configuration is as follows:
HA-Setup, 3-node cluster. 1 zero and 1 alpha instance per node.
Because we are using instances with ephemeral volumes, anytime one of our nodes goes down, the disk state of that node is completely lost. This may be important to my questions.
If we lose a single node (EC2 instance crashes/restarts) and then join that same (ip address) node back to the dgraph cluster and forget to /removeNode, the cluster appears to become all kinds of confused. The restarted instance appears to become its own (singleton) cluster, and the original zero leader no longer responds to /admin requests. If we remember to /removeNode before replacing the failed node, and give the node a new IDX, it recovers gracefully. Is it expected behavior that the entire cluster would get into an unrecoverable state in this scenario?
If we lose 2 nodes (even if both nodes were not the leader), the cluster becomes unusable. We cannot query the cluster, the zero leader stops responding to /admin requests, and we are never able to recover. Is this expected behavior?
I believe a lot of this is due to the failed nodes losing their disk state, and the replacment node having to act as a new node. But, in the production checklist document it is recommended to use instances with ephemeral disks. We can work around these things, but I wanted to first ensure this was expected behavior. These behaviors make our cluster very fragile, and make it prone to complete failures.
Calling /removeNode is the right thing to do. If a machine fails in such a way where the data directories are lost, then /removeNode must be called on the existing cluster to clean up the original membership. Otherwise, it’s expected that the original instance can come back up healthy again.
A majority of the group must be up to service requests. So, with 3 Alphas in a group, at least 2 (majority of 3) must be up and running. Similarly, 2 of the 3 Zeros must be up as well. a 6-node Dgraph cluster (3 Zeros, 3 Alphas) is resilient to losing 1 Zero and/or 1 Alpha instance and the remaining cluster instances can still serve requests.
For question (1), I wanted to add further, with zero nodes, the idx has to be managed explicitly, so if the zero2 with idx=2 was /removeNode, you would add a new zero with idx=4.
For zero wal and alpha postings and wal, persistence is needed to have recoverability. The state of the groups will need to be maintained, such as membership, timestamps, etc. and you losing such state is destructive. For alphas, obviously the db data, so if a node is lost, and db data is not persisted, db data will have to be restored/replicated from other alpha nodes.
So I recommend having a disk that can be re-attached. Thus if you had alpha1 that is lost and replaced, the external disk for alpha1 can be reattached. If alpha2 was lost/replaced, you’d need to reattach external disk for alpha2, and so forth for all nodes.