Questions about expected clustering behaviour

NickBOA · October 5, 2020, 2:36pm

Hello,

My company has been PoCing a dgraph cluster for the last couple of weeks, and I wanted to ask a few questions about some behavior we have been seeing, and if it is expected. Our cluster configuration is as follows:

HA-Setup, 3-node cluster. 1 zero and 1 alpha instance per node.
AWS c5d instance types which contain an ephemeral NVME volume. This instance type is recommended in the document here: https://dgraph.io/docs/deploy/production-checklist/
Our processes’ configurations are as follows:

dgraph zero --replicas=3 --idx=1 --my=zero1:5080
dgraph zero --replicas=3 --idx=2 --my=zero2:5080 --peer=zero1:5080
dgraph zero --replicas=3 --idx=3 --my=zero3:5080 --peer=zero1:5080

dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=1 --my=alpha1:7080 --zero=zero1:7080,zero2:7080,zero3:7080
dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=2 --my=alpha2:7080 --zero=zero1:7080,zero2:7080,zero3:7080
dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=3 --my=alpha3:7080 --zero=zero1:7080,zero2:7080,zero3:7080

Because we are using instances with ephemeral volumes, anytime one of our nodes goes down, the disk state of that node is completely lost. This may be important to my questions.

Questions

If we lose a single node (EC2 instance crashes/restarts) and then join that same (ip address) node back to the dgraph cluster and forget to /removeNode, the cluster appears to become all kinds of confused. The restarted instance appears to become its own (singleton) cluster, and the original zero leader no longer responds to /admin requests. If we remember to /removeNode before replacing the failed node, and give the node a new IDX, it recovers gracefully. Is it expected behavior that the entire cluster would get into an unrecoverable state in this scenario?
If we lose 2 nodes (even if both nodes were not the leader), the cluster becomes unusable. We cannot query the cluster, the zero leader stops responding to /admin requests, and we are never able to recover. Is this expected behavior?

I believe a lot of this is due to the failed nodes losing their disk state, and the replacment node having to act as a new node. But, in the production checklist document it is recommended to use instances with ephemeral disks. We can work around these things, but I wanted to first ensure this was expected behavior. These behaviors make our cluster very fragile, and make it prone to complete failures.

Thank you so much for your time!

joaquin · October 7, 2020, 11:07pm

NickBOA:

dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=1 --my=alpha1:7080 --zero=zero1:7080,zero2:7080,zero3:7080
dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=2 --my=alpha2:7080 --zero=zero1:7080,zero2:7080,zero3:7080
dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=3 --my=alpha3:7080 --zero=zero1:7080,zero2:7080,zero3:7080

As I am reading this, something jumped out immediately regarding configuration. First, with alphas, you don’t need to specify an idx explicitly. And second, more importantly, the zeros use 5080.

dgraph alpha --lru_mb=2048 --whitelist <redacted> --my=alpha1:7080 --zero=zero1:5080,zero2:5080,zero3:5080
dgraph alpha --lru_mb=2048 --whitelist <redacted> --my=alpha2:7080 --zero=zero1:5080,zero2:5080,zero3:5080
dgraph alpha --lru_mb=2048 --whitelist <redacted> --my=alpha3:7080 --zero=zero1:5080,zero2:5080,zero3:5080

dmai · October 7, 2020, 11:10pm

Calling /removeNode is the right thing to do. If a machine fails in such a way where the data directories are lost, then /removeNode must be called on the existing cluster to clean up the original membership. Otherwise, it’s expected that the original instance can come back up healthy again.

A majority of the group must be up to service requests. So, with 3 Alphas in a group, at least 2 (majority of 3) must be up and running. Similarly, 2 of the 3 Zeros must be up as well. a 6-node Dgraph cluster (3 Zeros, 3 Alphas) is resilient to losing 1 Zero and/or 1 Alpha instance and the remaining cluster instances can still serve requests.

joaquin · October 7, 2020, 11:33pm

For question (1), I wanted to add further, with zero nodes, the idx has to be managed explicitly, so if the zero2 with idx=2 was /removeNode, you would add a new zero with idx=4.

For zero wal and alpha postings and wal, persistence is needed to have recoverability. The state of the groups will need to be maintained, such as membership, timestamps, etc. and you losing such state is destructive. For alphas, obviously the db data, so if a node is lost, and db data is not persisted, db data will have to be restored/replicated from other alpha nodes.

So I recommend having a disk that can be re-attached. Thus if you had alpha1 that is lost and replaced, the external disk for alpha1 can be reattached. If alpha2 was lost/replaced, you’d need to reattach external disk for alpha2, and so forth for all nodes.

NickBOA · October 8, 2020, 12:22pm

Thank you so much for the response. These answers are what I suspected. Given this, would you say that the recommendation here to use EC2 instances with ephemeral disks is incorrect?

Topic		Replies	Views
Production instance is taking entire load for cluster Users	8	693	November 21, 2019
Why can't Dgraph work well when I kill one node in the cluster Users	10	2001	December 6, 2017
[disaster recovery] Cluster unable to recover after crash during intensive writing operations Dgraph dgraph , status:accepted , area:performance	12	1267	November 17, 2020
Dgraph Alpha Node unresponsive Dgraph	10	1140	September 10, 2022
Problems with node removal Dgraph cluster	14	1290	January 7, 2021

Questions about expected clustering behaviour

Related topics