Clarification on machine requirements for Dgraph

What we want to do

After what seems to have been a corruption issue as discussed in this topic, my team is now in the works of rethinking and reinitiating our Dgraph clusters for our production and pre-production environments. With access to a limited amount of servers (some bare metal and some VMs), we want to determine the best setup for both environments to achieve the maximum amount of performance and reliability.

What we did

Before our production cluster became corrupted, we had the following setup on the six (and only six) bare metal servers that we have access to use. All of the Zeros and Alphas listed below were run as containers in a Docker Swarm; and the Alphas were specifically assigned to these four groups on creation (with three replicas each).

Server 1
Zero 1
Alpha 1 (Group 1)
Alpha 2 (Group 2)

Server 2
Zero 2
Alpha 3 (Group 1)
Alpha 4 (Group 3)

Server 3
Zero 3
Alpha 5 (Group 1)
Alpha 6 (Group 4)

Server 4
Alpha 7 (Group 2)
Alpha 8 (Group 3)

Server 5
Alpha 9 (Group 2)
Alpha 10 (Group 4)

Server 6
Alpha 11 (Group 3)
Alpha 12 (Group 4)

Our pre-production cluster is also running as a Docker Swarm but on four VMs instead of the bare metal servers above. Groups are not specified in this setup; but replicas are set to three, so all the Alphas are placed in Group 1. This cluster is very important for the time being, as our live production application is currently set up to use the pre-production database. It looks like this:

VM 1
Zero 1

VM 2
Zero 2
Alpha 1

VM 3
Alpha 2

VM 4
Zero 3
Alpha 3

Our questions

  1. The machine requirements in the documentation (here) specifically call out the fact that multiple Dgraph Zeros or Dgraph Alpha processes should not be run on the same machine. Since our Dgraph nodes in both environments are run as Docker containers, does this requirement still apply? If so, why?

  2. As you can see from our former production setup above, we were breaking that rule by running two Alpha containers on each of our six bare metal servers. We thought that this would increase performance by splitting the predicates up into four HA groups instead of just two. With an ingest process that uses several different upserts for each piece of data, is there a performance advantage to using a setup like this? What are the disadvantages, and do they outweigh the advantages?

  3. As more data than ever before is being loaded into our pre-production system hosted on VMs, we began to notice the rate of ingestion slowing down considerably. We see the following errors and warnings occurring over and over again in the Alpha logs, which we believe to be contributing to this change in pace. Will you please explain the probable cause of these errors?

    dgraph_alpha1    | E0502 13:25:11.319806      15 node.go:519] Error while calling IsPeer rpc error: code = DeadlineExceeded desc = context deadline exceeded. Reporting 1 as unreachable.
    dgraph_alpha1    | W0502 13:25:11.319910      15 node.go:424] Unable to send message to peer: 0x1. Error: while calling IsPeer 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    dgraph_alpha1    | W0502 13:25:12.285382      15 pool.go:267] CONN: No echo to alpha3:7082 for 2562047h47m16.854775807s. Cancelling connection heartbeats.
    dgraph_alpha1    | I0502 13:25:13.483518      15 pool.go:327] CONN: Re-established connection with alpha3:7082.
    
  4. We’ve considered moving our pre-production cluster to also be on the more performant bare metal servers, instead of on VMs. If we did this, we would need multiple Alphas on the same underlying host machine to support the two different HA and sharded clusters. Does the requirement listed in Question #1 also apply to two Alphas on the same machine running in completely different clusters?

  5. Given the limited amount of servers listed above and the performance issues noted, what cluster settings would you recommend for both our production and pre-production environments? Due to the fact that we’re using pre-production for our application while getting production reconfigured, it’s imperative that both be as performant and reliable as possible.

Thank you!!!

Dgraph metadata

dgraph version
Dgraph version   : v22.0.2
Dgraph codename  : dgraph
Dgraph SHA-256   : a11258bf3352eff0521bc68983a5aedb9316414947719920d75f12143dd368bd
Commit SHA-1     : 55697a4
Commit timestamp : 2022-12-16 23:03:35 +0000
Branch           : release/v22.0.2
Go version       : go1.18.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud   , visit https://dgraph.io/cloud.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2022 Dgraph Labs, Inc.

In general, all nodes/instances should be physically separate. You can run multiple instances on the same machine, but if something happens to it, the cluster may not recover, as it has lost many nodes. Zeros should not be together with groups of Alphas, because when one goes down, they need to elect a new Zero leader. If there are no Zeros available for a new election, the cluster will become unhealthy.

As you mentioned, running multiple shards can indeed increase performance. However, if you have multiple Alphas on the same machine and the resources are not well-dimensioned, it might not be as good. One of the main disadvantages of running multiple instances on the same machine is the potential for resource contention, such as CPU, memory, and I/O, which can lead to degraded performance. Additionally, running multiple instances on the same machine increases the risk of data loss or cluster unavailability in case of a hardware failure or other issues affecting the machine. The disadvantages may outweigh the advantages, especially in terms of reliability and fault tolerance. It is generally recommended to distribute instances across different machines to ensure better resource allocation and minimize the impact of potential hardware failures.

The errors you mentioned seem to be related to Alpha instances going offline, possibly due to overload. To address this issue, make sure that the data is being inserted in a balanced manner across the different instances. If you force only one Alpha to handle the ingestion workload, it may become overloaded.

In Dgraph’s live loader, there’s an option to provide multiple Alpha addresses. This allows the live loader to distribute the mutations more evenly across the Alphas, which can help alleviate the problem of overloading a single instance. Ensuring proper load balancing across all instances can help improve the ingestion rate.

Yes. You can use VM just fine too. I’d recomend using KVM, maybe Proxmox. It also have some features like snapshots. You may use some VM orchestrator like Vagrant.

It’s not possible to provide precise recommendations based on the limited information provided, and a thorough analysis of your infrastructure and data model would be required to tailor suggestions to your specific use case. Each case is unique, and each customer has different needs, making it difficult to provide a one-size-fits-all recommendation.

The best approach would be to conduct performance tests and benchmarking to identify the optimal configuration for your specific requirements. Since we cannot test every possible scenario, we cannot provide a definitive answer. However, some general recommendations to improve performance and reliability include ensuring proper resource allocation, load balancing, and distribution of instances across multiple machines.

1 Like