What we want to do
After what seems to have been a corruption issue as discussed in this topic, my team is now in the works of rethinking and reinitiating our Dgraph clusters for our production and pre-production environments. With access to a limited amount of servers (some bare metal and some VMs), we want to determine the best setup for both environments to achieve the maximum amount of performance and reliability.
What we did
Before our production cluster became corrupted, we had the following setup on the six (and only six) bare metal servers that we have access to use. All of the Zeros and Alphas listed below were run as containers in a Docker Swarm; and the Alphas were specifically assigned to these four groups on creation (with three replicas each).
Alpha 1 (Group 1)
Alpha 2 (Group 2)
Alpha 3 (Group 1)
Alpha 4 (Group 3)
Alpha 5 (Group 1)
Alpha 6 (Group 4)
Alpha 7 (Group 2)
Alpha 8 (Group 3)
Alpha 9 (Group 2)
Alpha 10 (Group 4)
Alpha 11 (Group 3)
Alpha 12 (Group 4)
Our pre-production cluster is also running as a Docker Swarm but on four VMs instead of the bare metal servers above. Groups are not specified in this setup; but replicas are set to three, so all the Alphas are placed in Group 1. This cluster is very important for the time being, as our live production application is currently set up to use the pre-production database. It looks like this:
The machine requirements in the documentation (here) specifically call out the fact that multiple Dgraph Zeros or Dgraph Alpha processes should not be run on the same machine. Since our Dgraph nodes in both environments are run as Docker containers, does this requirement still apply? If so, why?
As you can see from our former production setup above, we were breaking that rule by running two Alpha containers on each of our six bare metal servers. We thought that this would increase performance by splitting the predicates up into four HA groups instead of just two. With an ingest process that uses several different upserts for each piece of data, is there a performance advantage to using a setup like this? What are the disadvantages, and do they outweigh the advantages?
As more data than ever before is being loaded into our pre-production system hosted on VMs, we began to notice the rate of ingestion slowing down considerably. We see the following errors and warnings occurring over and over again in the Alpha logs, which we believe to be contributing to this change in pace. Will you please explain the probable cause of these errors?
dgraph_alpha1 | E0502 13:25:11.319806 15 node.go:519] Error while calling IsPeer rpc error: code = DeadlineExceeded desc = context deadline exceeded. Reporting 1 as unreachable. dgraph_alpha1 | W0502 13:25:11.319910 15 node.go:424] Unable to send message to peer: 0x1. Error: while calling IsPeer 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded dgraph_alpha1 | W0502 13:25:12.285382 15 pool.go:267] CONN: No echo to alpha3:7082 for 2562047h47m16.854775807s. Cancelling connection heartbeats. dgraph_alpha1 | I0502 13:25:13.483518 15 pool.go:327] CONN: Re-established connection with alpha3:7082.
We’ve considered moving our pre-production cluster to also be on the more performant bare metal servers, instead of on VMs. If we did this, we would need multiple Alphas on the same underlying host machine to support the two different HA and sharded clusters. Does the requirement listed in Question #1 also apply to two Alphas on the same machine running in completely different clusters?
Given the limited amount of servers listed above and the performance issues noted, what cluster settings would you recommend for both our production and pre-production environments? Due to the fact that we’re using pre-production for our application while getting production reconfigured, it’s imperative that both be as performant and reliable as possible.
Dgraph version : v22.0.2 Dgraph codename : dgraph Dgraph SHA-256 : a11258bf3352eff0521bc68983a5aedb9316414947719920d75f12143dd368bd Commit SHA-1 : 55697a4 Commit timestamp : 2022-12-16 23:03:35 +0000 Branch : release/v22.0.2 Go version : go1.18.5 jemalloc enabled : true For Dgraph official documentation, visit https://dgraph.io/docs. For discussions about Dgraph , visit https://discuss.dgraph.io. For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud. Licensed variously under the Apache Public License 2.0 and Dgraph Community License. Copyright 2015-2022 Dgraph Labs, Inc.