Why can't Dgraph work well when I kill one node in the cluster

jeffrey · November 1, 2017, 9:58am

I create a cluster with docker. And the docker-compose.yml file as below.

#####node1#####

dzero1:
  image: dgraph/dgraph
  ports:
    - 28888:8888
    - 28889:8889
  volumes:
    - /home/ceq/dgraph-cluster/h-1:/dgraph
  command: dgraphzero -w=wz1 --my=192.168.44.160:28888 --bindall=true -idx=1 --replicas=1

dgraph_graph1:
  image: dgraph/dgraph
  ports:
    - 28080:8080
    - 29080:9080
    - 21234:12345
  volumes:
    - /home/ceq/dgraph-cluster/h-1:/dgraph
  command: dgraph --peer=192.168.44.160:28888 --my=192.168.44.160:21234 --bindall=true --idx=1 --memory_mb=2048 -groups=1

####node2####

dzero2:
  image: dgraph/dgraph
  ports:
    - 38888:8888
    - 38889:8889
  volumes:
    - /home/ceq/dgraph-cluster/h-2:/dgraph
  command: dgraphzero -w=wz2 --peer=192.168.44.160:28888 --my=192.168.44.160:38888 --bindall=true -idx=2 --replicas=1

dgraph_graph2:
  image: dgraph/dgraph
  ports:
    - 38080:8080
    - 39080:9080
    - 31234:12345
  volumes:
    - /home/ceq/dgraph-cluster/h-2:/dgraph
  command: dgraph --peer=192.168.44.160:28888 --my=192.168.44.160:31234 --bindall=true --idx=2 --memory_mb=2048 -groups=1

####node3####

dzero3:
  image: dgraph/dgraph
  ports:
    - 48888:8888
    - 48889:8889
  volumes:
    - /home/ceq/dgraph-cluster/h-3:/dgraph
  command: dgraphzero -w=wz3 --peer=192.168.44.160:28888 --my=192.168.44.160:48888 --bindall=true -idx=3 --replicas=1

dgraph_graph3:
  image: dgraph/dgraph
  ports:
    - 48080:8080
    - 49080:9080
    - 41234:12345
  volumes:
    - /home/ceq/dgraph-cluster/h-3:/dgraph
  command: dgraph --peer=192.168.44.160:28888 --my=192.168.44.160:41234 --bindall=true --idx=3 --memory_mb=2048 -groups=1

It work well when I run docker-compose up, but when I kill anyone of the nodes,I can’t search data from the rest. The error is
"""dispatchTaskOverNetwork: while retrieving connection. error: Unhealthy connection"""

is there any configuration i missed? anyone can tell?

peter · November 1, 2017, 11:39pm

Predicates data is going to be split up between the 3 dgraph instances. When you shut one of them down, there won’t be enough data to complete queries anymore.

You can get around this by using replication. E.g. use a single dgraphzero, and 3 dgraph instances. When you start dgraphzero, set --replicas=3. That way, each instance will serve the same set of predicates and you will be able to survive one of the nodes going down.

jeffrey · November 2, 2017, 2:03am

Thanks for your help.Recently We are consider using Dgraph to substitute Neo4j in the production environment.Is there any suggestion? thank you.

peter · November 2, 2017, 2:10am

Dgraph is ready to use in production, and great if you want scalability and performance.

Note that we are planning to make a new v0.9 release soon, which will have some big changes to the way clients interact with dgraph. Take a look over at Major changes in v0.9.

jeffrey · November 2, 2017, 4:05am

thx

And I have a little confuse in the concepts between Replication and Group.
1 Is data only replicate between nodes which belong to the same group?or it can replicate across groups?
2 How should I config the nodes for cluster to keep it working when partial nodes fail?
nodes are

partial of one group
all of one group

I have seen a passage on the official websit that

Replication and Server Failure
Each group should typically be served by atleast 3 servers, if available. In the case of a machine failure, other servers serving the same group can still handle the load in that case.

peter · November 2, 2017, 7:30am

Each predicate will belong to exactly 1 group. Data replication is only between nodes in the same group. There is no cross-group replication.

E.g. you could have 4 nodes, split across 2 groups (2 nodes in each group). Replication would occur within each group, so that each edge in the graph is duplicated between 2 nodes.

The passage on the website is really talking about situations where you need HA (high availability). By setting up servers in groups of 3 (i.e. 3 per group), one server doing down would leave 2 remaining servers. 2 remaining servers could likely still handle normal load.

If HA is critical for you, then you should run at least 3 dgraphzeros, and 3 dgraphs. If you are able to have more nodes, then you could run 3 dgraphzeros and 6 dgraphs split between 2 groups.

If all nodes in one group fail, then you will experience downtime until the nodes are brought back online.

jeffrey · November 2, 2017, 7:44am

Thank you for the great help! This have confused me two days,now I have a clearly concept.

daviddhc20120601 · November 3, 2017, 3:20am

what if i kill the leader of the group?
seems it can not find the leader
looping with the voting system

peter · November 5, 2017, 10:58pm

@daviddhc20120601, I haven’t noticed that as a problem before. If the leader dies, a new leader should be elected. If not, then that’s a bug. I’ll investigate this, you can follow here: Can a cluster survive leadership death? · Issue #1721 · dgraph-io/dgraph · GitHub

peter · November 6, 2017, 12:00am

@daviddhc20120601, I haven’t been able to reproduce the leadership issue so I’ve closed as “can’t reproduce”. If you can reproduce the issue and have some logs, I can re-open the ticket and do some further investigation.

system · December 6, 2017, 12:00am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Crash replicas and recover Dgraph	6	1151	July 30, 2018
Production instance is taking entire load for cluster Users	8	693	November 21, 2019
Unable to run dgraph in a multi-node kubernetes cluster Users	47	5075	March 14, 2018
Dgraph server log Dgraph	15	2078	June 27, 2018
Questions about expected clustering behaviour Dgraph kind:question , dgraph	4	489	October 8, 2020

Why can't Dgraph work well when I kill one node in the cluster

Related topics