My company has been PoCing a
dgraph cluster for the last couple of weeks, and I wanted to ask a few questions about some behavior we have been seeing, and if it is expected. Our cluster configuration is as follows:
- HA-Setup, 3-node cluster. 1 zero and 1 alpha instance per node.
c5dinstance types which contain an ephemeral NVME volume. This instance type is recommended in the document here: https://dgraph.io/docs/deploy/production-checklist/
- Our processes’ configurations are as follows:
dgraph zero --replicas=3 --idx=1 --my=zero1:5080 dgraph zero --replicas=3 --idx=2 --my=zero2:5080 --peer=zero1:5080 dgraph zero --replicas=3 --idx=3 --my=zero3:5080 --peer=zero1:5080 dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=1 --my=alpha1:7080 --zero=zero1:7080,zero2:7080,zero3:7080 dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=2 --my=alpha2:7080 --zero=zero1:7080,zero2:7080,zero3:7080 dgraph alpha --lru_mb=2048 --whitelist <redacted> --idx=3 --my=alpha3:7080 --zero=zero1:7080,zero2:7080,zero3:7080
- Because we are using instances with ephemeral volumes, anytime one of our nodes goes down, the disk state of that node is completely lost. This may be important to my questions.
If we lose a single node (EC2 instance crashes/restarts) and then join that same (ip address) node back to the
dgraphcluster and forget to
/removeNode, the cluster appears to become all kinds of confused. The restarted instance appears to become its own (singleton) cluster, and the original zero leader no longer responds to
/adminrequests. If we remember to
/removeNodebefore replacing the failed node, and give the node a new IDX, it recovers gracefully. Is it expected behavior that the entire cluster would get into an unrecoverable state in this scenario?
If we lose 2 nodes (even if both nodes were not the leader), the cluster becomes unusable. We cannot query the cluster, the zero leader stops responding to
/adminrequests, and we are never able to recover. Is this expected behavior?
I believe a lot of this is due to the failed nodes losing their disk state, and the replacment node having to act as a new node. But, in the production checklist document it is recommended to use instances with ephemeral disks. We can work around these things, but I wanted to first ensure this was expected behavior. These behaviors make our cluster very fragile, and make it prone to complete failures.
Thank you so much for your time!