Replacing zero and server nodes

I’ve built a HA cluster in docker swarm and tried to simulate faulire situations.

Cluster is running on three hosts, one zero node and one server node on each. All services have deployment constraints so they won’t move from one node to another. Server ids are set explicitly. Data is saved in named volumes - see docker-compose.yml:

version: "3.4"

services:
  zero-1:
    image: dgraph/dgraph:v1.0.2
    hostname: "zero-1"
    command: dgraph zero -o -2000 --my=zero-1:5080 --replicas 3 --idx 1
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    ports:
      - 6080:6080
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-1

  zero-2:
    image: dgraph/dgraph:v1.0.2
    hostname: "zero-2"
    command: dgraph zero -o -2000 --my=zero-2:5080 --replicas 3 --idx 2 --peer zero-1:5080
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-2

  zero-3:
    image: dgraph/dgraph:v1.0.2
    hostname: "zero-3"
    command: dgraph zero -o -2000 --my=zero-3:5080 --replicas 3 --idx 3 --peer zero-1:5080
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-3

  server-1:
    image: dgraph/dgraph:v1.0.2
    hostname: "server-1"
    command: dgraph server --my=server-1:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    ports:
      - 8080:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-1

  server-2:
    image: dgraph/dgraph:v1.0.2
    hostname: "server-2"
    command: dgraph server --my=server-2:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    ports:
      - 8081:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-2

  server-3:
    image: dgraph/dgraph:v1.0.2
    hostname: "server-3"
    command: dgraph server --my=server-3:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    ports:
      - 8082:8080
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-3

  ratel:
    image: dgraph/dgraph:v1.0.2
    command: dgraph-ratel
    networks:
      - dgraph
    ports:
      - 18049:8081

networks:
  dgraph:
    external: true

volumes:
  data:

After deploying a stack some data is added via dgraph live - a subset of 1million.rdf.gz from the tour, about 10k triplets.
So cluster is running and has data in it. Let’s try to simulate a failure of node #3 - all data is wiped, zero and server:

[root@swarm-manager-3 ~]# service docker stop
[root@swarm-manager-3 ~]# rm -rf /var/lib/docker/volumes/dgraph_data/*
[root@swarm-manager-3 ~]# service docker start

After bringing up new zero and server nodes with same ids, hostnames and selfnames, cluster cannot restore itself, because working zero nodes are still trying to reconnect to the wiped node (which doesn’t have RAFT logs anymore):

dgraph_zero-1.1.mswcvo8xjl3i@swarm-manager-1    | 2018/01/25 17:02:29 pool.go:167: Echo error from zero-3:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
dgraph_zero-1.1.mswcvo8xjl3i@swarm-manager-1    | 2018/01/25 17:02:32 raft.go:531: While applying proposal: Invalid address
dgraph_zero-1.1.mswcvo8xjl3i@swarm-manager-1    | 2018/01/25 17:02:36 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
dgraph_zero-1.1.mswcvo8xjl3i@swarm-manager-1    | 2018/01/25 17:02:36 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 2018/01/25 16:44:59 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 2018/01/25 16:44:59 raft.go:708: INFO: 3 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 4]
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | 2018/01/25 16:42:09 raft.go:708: INFO: 3 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 4]
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | 2018/01/25 15:31:48 raft.go:708: INFO: 3 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 4]
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | 2018/01/25 15:31:48 raft.go:567: INFO: 3 became follower at term 4
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 2018/01/25 16:44:59 raft.go:567: INFO: 3 became follower at term 4
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | 2018/01/25 16:42:09 raft.go:567: INFO: 3 became follower at term 4
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | 2018/01/25 16:42:09 logger.go:121: tocommit(150) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | 2018/01/25 15:31:48 logger.go:121: tocommit(150) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 2018/01/25 16:44:59 logger.go:121: tocommit(150) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | panic: tocommit(150) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | panic: tocommit(150) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | panic: tocommit(150) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | 
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | 
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | goroutine 155 [running]:
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | goroutine 168 [running]:
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | goroutine 154 [running]:
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | log.(*Logger).Panicf(0xc420066a50, 0x13398f0, 0x5d, 0xc42027b0c0, 0x2, 0x2)
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | log.(*Logger).Panicf(0xc420066a50, 0x13398f0, 0x5d, 0xc420116900, 0x2, 0x2)
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | log.(*Logger).Panicf(0xc420066a50, 0x13398f0, 0x5d, 0xc4202226e0, 0x2, 0x2)
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | 	/usr/local/go/src/log/log.go:219 +0xdb
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 	/usr/local/go/src/log/log.go:219 +0xdb
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | 	/usr/local/go/src/log/log.go:219 +0xdb
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | github.com/dgraph-io/dgraph/vendor/github.com/coreos/etcd/raft.(*DefaultLogger).Panicf(0xc42028b680, 0x13398f0, 0x5d, 0xc420116900, 0x2, 0x2)
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | github.com/dgraph-io/dgraph/vendor/github.com/coreos/etcd/raft.(*DefaultLogger).Panicf(0xc420289690, 0x13398f0, 0x5d, 0xc4202226e0, 0x2, 0x2)
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | github.com/dgraph-io/dgraph/vendor/github.com/coreos/etcd/raft.(*DefaultLogger).Panicf(0xc42028b670, 0x13398f0, 0x5d, 0xc42027b0c0, 0x2, 0x2)
dgraph_zero-3.1.zyoxrp21ocjv@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/vendor/github.com/coreos/etcd/raft/logger.go:121 +0x60
dgraph_zero-3.1.zz9yqea1plhc@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/vendor/github.com/coreos/etcd/raft/logger.go:121 +0x60
dgraph_zero-3.1.zvs2jaxfc6np@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/vendor/github.com/coreos/etcd/raft/logger.go:121 +0x60

Zero-3 is present in the list of zero nodes (and will be there even if it goes down). Output of /state:

{"counter":"2333","groups":{"1":{"members":{"1":{"id":"1","groupId":1,"addr":"server-3:7080","lastUpdate":"1516890725"},"2":{"id":"2","groupId":1,"addr":"server-2:7080","leader":true,"lastUpdate":"1516890868"},"3":{"id":"3","groupId":1,"addr":"server-1:7080"}},"tablets":{"_predicate_":{"groupId":1,"predicate":"_predicate_","space":"6737282"},"actor.film":{"groupId":1,"predicate":"actor.film","space":"154599"},"director.film":{"groupId":1,"predicate":"director.film","space":"9785"},"genre":{"groupId":1,"predicate":"genre","space":"24307"},"initial_release_date":{"groupId":1,"predicate":"initial_release_date","space":"25351"},"name":{"groupId":1,"predicate":"name","space":"6304250"},"performance.actor":{"groupId":1,"predicate":"performance.actor","space":"189068"},"performance.character":{"groupId":1,"predicate":"performance.character","space":"206256"},"performance.film":{"groupId":1,"predicate":"performance.film","space":"184814"},"starring":{"groupId":1,"predicate":"starring","space":"75475"}}}},"zeros":{"1":{"id":"1","addr":"zero-1:5080","leader":true},"2":{"id":"2","addr":"zero-2:5080"},"3":{"id":"3","addr":"zero-3:5080"}},"maxLeaseId":"1010000","maxTxnTs":"10000","maxRaftId":"2172"}

More logs, stopped docker service on host with server-3/zero-3, purged data in the volume, removed server-3 via /removeNode, removed the whole stack (docker stack rm dgraph), added zero-4 and server-4 to the docker-compose (on the same host where server-3/zero-3 used to be), started docker on third host and deployed the stack again. In this situation all server nodes crash constantly - even the working ones:

dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 2018/01/26 08:37:50 gRPC server started.  Listening on port 9080
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 2018/01/26 08:37:50 HTTP server started.  Listening on port 8080
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 2018/01/26 08:37:50 groups.go:86: Current Raft Id: 2
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 2018/01/26 08:37:50 worker.go:99: Worker listening at address: [::]:7080
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 2018/01/26 08:37:50 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 2018/01/26 08:37:50 groups.go:109: Connected to group zero. Connection state: member:<id:2 addr:"server-1:7080" > state:<counter:432 groups:<key:1 value:<members:<key:2 value:<id:2 group_id:1 addr:"server-1:7080" > > members:<key:3 value:<id:3 group_id:1 addr:"server-2:7080" leader:true last_update:1516955215 > > members:<key:32 value:<id:32 group_id:1 addr:"server-3:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" space:6737282 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" space:154599 > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" space:32420 > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" space:35893 > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" space:31594 > > tablets:<key:"name" value:<group_id:1 predicate:"name" space:9925639 > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" space:189068 > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" space:206256 > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" space:184814 > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" space:80342 > > > > groups:<key:2 value:<members:<key:59 value:<id:59 group_id:2 addr:"server-4:7080" > > > > zeros:<key:1 value:<id:1 addr:"zero-1:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"zero-2:5080" > > zeros:<key:3 value:<id:3 addr:"zero-4:5080" > > maxLeaseId:1010000 maxTxnTs:20000 maxRaftId:86 removed:<id:1 group_id:1 addr:"server-3:7080" last_update:1516954819 > > 
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | panic: runtime error: invalid memory address or nil pointer dereference
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1097b45]
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | goroutine 248 [running]:
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | github.com/dgraph-io/dgraph/worker.(*groupi).applyState(0xc420116000, 0xc4257c28c0)
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:245 +0x545
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc4203d4010, 0x1)
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:111 +0x58a
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | created by github.com/dgraph-io/dgraph/dgraph/cmd/server.run
dgraph_server-1.1.twco76veqqbh@swarm-manager-1    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/dgraph/cmd/server/run.go:351 +0x82b

Is there a way to safely replace zero node if it unexpectedly leaves the cluster forever? Should we backup RAFT logs to be able to restore the node? Do replacement nodes need to have new ids/hostnames/whatever? Server nodes have /removeNode endpoint, but zeros do not.

@nbnh

/removeNode api should ideally work with Zero nodes by taking group argument as 0. I’ll verify that it does. Still, Dgraph nodes shouldn’t crash. I’ll investigate this and get back to you.

Yea, I can remove zero node with group=0 (was that mentioned in the docs?), but crashing is still there.

Steps to reproduce: start a docker stack (using yml from first post), insert some data via live, index it, shut down one host. In my case, it was a host with leader server and non-leader zero. I am using KVM machines, so I shut them down via virsh destroy. Start the node without network device so it won’t reconnect to the swarm automatically, remove the contents of volume, remove failed nodes from the cluster via removeNode, both server and zero, restart the node with network device. Since we are still using same stack, server and zero will be recreated with same ids and names as before. Zero works somewhat fine, there are some connection issues though:

[root@swarm-manager-1 dgraph]# docker service logs -f dgraph_zero-3
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | Setting up grpc listener at: 0.0.0.0:5080
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | Setting up grpc listener at: 0.0.0.0:5080
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | Setting up http listener at: 0.0.0.0:6080
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | Setting up http listener at: 0.0.0.0:6080
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | 2018/01/29 08:03:05 node.go:258: Group 0 found 0 entries
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:02 node.go:258: Group 0 found 0 entries
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:02 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | 2018/01/29 08:03:05 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | 2018/01/29 08:03:05 raft.go:567: INFO: 3 became follower at term 0
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | Running Dgraph zero...
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:02 raft.go:567: INFO: 3 became follower at term 0
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | 2018/01/29 08:03:05 raft.go:315: INFO: newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | 2018/01/29 08:03:05 raft.go:567: INFO: 3 became follower at term 1
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:02 raft.go:315: INFO: newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:02 raft.go:567: INFO: 3 became follower at term 1
dgraph_zero-3.1.jt698jgw0knq@swarm-manager-3    | Running Dgraph zero...
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 raft.go:708: INFO: 3 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 2]
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 raft.go:567: INFO: 3 became follower at term 2
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 node.go:301: INFO: raft.node: 3 elected leader 1 at term 2
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 node.go:127: Setting conf state to nodes:1 
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 pool.go:118: == CONNECT ==> Setting server-3:7080
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 pool.go:118: == CONNECT ==> Setting zero-2:5080
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 node.go:127: Setting conf state to nodes:1 nodes:2 
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3 
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 pool.go:167: Echo error from server-2:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 pool.go:118: == CONNECT ==> Setting server-2:7080
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:12 pool.go:118: == CONNECT ==> Setting server-1:7080
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:48:22 pool.go:167: Echo error from server-2:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:52:42 oracle.go:84: purging below ts:11, len(o.commits):6, len(o.aborts):0
dgraph_zero-3.1.yl2568wir6xn@swarm-manager-3    | 2018/01/29 07:52:52 oracle.go:84: purging below ts:17, len(o.commits):5, len(o.aborts):0

Server node still having issues and crashes approx each 30 seconds:

[root@swarm-manager-1 dgraph]# docker service logs -f dgraph_server-3
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 2018/01/29 08:00:32 gRPC server started.  Listening on port 9080
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 2018/01/29 07:58:21 gRPC server started.  Listening on port 9080
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 gRPC server started.  Listening on port 9080
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 HTTP server started.  Listening on port 8080
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 2018/01/29 08:00:32 HTTP server started.  Listening on port 8080
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 2018/01/29 07:58:21 HTTP server started.  Listening on port 8080
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 2018/01/29 07:58:21 groups.go:86: Current Raft Id: 1
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 groups.go:86: Current Raft Id: 0
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 2018/01/29 08:00:32 groups.go:86: Current Raft Id: 1
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 2018/01/29 08:00:32 worker.go:99: Worker listening at address: [::]:7080
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 2018/01/29 07:58:21 worker.go:99: Worker listening at address: [::]:7080
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 worker.go:99: Worker listening at address: [::]:7080
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 2018/01/29 08:00:32 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 2018/01/29 07:58:21 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 2018/01/29 07:58:21 groups.go:109: Connected to group zero. Connection state: member:<id:1 addr:"server-3:7080" > state:<counter:184 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"server-3:7080" > > members:<key:2 value:<id:2 group_id:1 addr:"server-1:7080" leader:true last_update:1517212541 > > members:<key:3 value:<id:3 group_id:1 addr:"server-2:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" space:6330624 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" space:154599 > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" space:40156 > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" space:32397 > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" space:41383 > > tablets:<key:"name" value:<group_id:1 predicate:"name" space:12832408 > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" space:189068 > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" space:206256 > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" space:184814 > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" space:36518 > > > > zeros:<key:1 value:<id:1 addr:"zero-1:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"zero-2:5080" > > maxLeaseId:1010000 maxTxnTs:10000 maxRaftId:3 removed:<id:1 group_id:1 addr:"server-3:7080" last_update:1517212098 > removed:<id:3 addr:"zero-3:5080" > > 
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 2018/01/29 08:00:32 groups.go:109: Connected to group zero. Connection state: member:<id:1 addr:"server-3:7080" > state:<counter:184 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"server-3:7080" > > members:<key:2 value:<id:2 group_id:1 addr:"server-1:7080" leader:true last_update:1517212541 > > members:<key:3 value:<id:3 group_id:1 addr:"server-2:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" space:6330624 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" space:154599 > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" space:40156 > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" space:32397 > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" space:41383 > > tablets:<key:"name" value:<group_id:1 predicate:"name" space:12832408 > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" space:189068 > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" space:206256 > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" space:184814 > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" space:36518 > > > > zeros:<key:1 value:<id:1 addr:"zero-1:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"zero-2:5080" > > maxLeaseId:1010000 maxTxnTs:10000 maxRaftId:3 removed:<id:1 group_id:1 addr:"server-3:7080" last_update:1517212098 > removed:<id:3 addr:"zero-3:5080" > > 
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | panic: runtime error: invalid memory address or nil pointer dereference
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | panic: runtime error: invalid memory address or nil pointer dereference
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:52 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1097b45]
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1097b45]
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:53 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | goroutine 197 [running]:
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | goroutine 175 [running]:
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:53 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:54 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | github.com/dgraph-io/dgraph/worker.(*groupi).applyState(0xc4200dc000, 0xc420065360)
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | github.com/dgraph-io/dgraph/worker.(*groupi).applyState(0xc4203de100, 0xc4203d4b90)
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:245 +0x545
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:56 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:245 +0x545
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc4257b4040, 0x1)
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc42000e028, 0x1)
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:09:59 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:10:05 groups.go:102: Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:111 +0x58a
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:111 +0x58a
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | created by github.com/dgraph-io/dgraph/dgraph/cmd/server.run
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 2018/01/29 08:10:18 Unable to join cluster via dgraphzero
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | created by github.com/dgraph-io/dgraph/dgraph/cmd/server.run
dgraph_server-3.1.ywqennysbgcc@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/dgraph/cmd/server/run.go:351 +0x82b
dgraph_server-3.1.xi5ekbsdzp2p@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/dgraph/cmd/server/run.go:351 +0x82b
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | github.com/dgraph-io/dgraph/x.Fatalf
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/x/error.go:103
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | github.com/dgraph-io/dgraph/worker.StartRaftNodes
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:107
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | runtime.goexit
dgraph_server-3.1.vks6pchojzd0@swarm-manager-3    | 	/usr/local/go/src/runtime/asm_amd64.s:2337

Please note that there are logs from two containers at once (why?) and log entries are not in order. I don’t see a reason why is that happening.

I’ve tried a master build, commit Fix nil pointer exception on restart after removing a peer. · dgraph-io/dgraph@74bbd98 · GitHub , which somewhat fixes the problem. Both types of nodes are removed correctly, blank nodes are able to join the cluster.

However, cluster is having issues with hard reboot of the node (without removing it from the cluster or removing any data, just virsh destroy). After the start of the crashed server, it tries to vote, it takes way too long. Crashed server:

[root@swarm-manager-1 _data]# docker service logs -f dgraph_server-3
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 gRPC server started.  Listening on port 9080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 HTTP server started.  Listening on port 8080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 groups.go:86: Current Raft Id: 2
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 worker.go:99: Worker listening at address: [::]:7080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 groups.go:109: Connected to group zero. Connection state: member:<id:2 addr:"server-3:7080" > state:<counter:168 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"server-2:7080" leader:true last_update:1517218389 > > members:<key:2 value:<id:2 group_id:1 addr:"server-3:7080" > > members:<key:3 value:<id:3 group_id:1 addr:"server-1:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" space:6737282 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" space:154599 > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" space:21934 > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" space:32397 > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" space:29272 > > tablets:<key:"name" value:<group_id:1 predicate:"name" space:8016290 > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" space:189068 > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" space:206256 > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" space:184814 > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" space:80342 > > > > zeros:<key:1 value:<id:1 addr:"zero-1:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"zero-2:5080" > > zeros:<key:3 value:<id:3 addr:"zero-3:5080" > > maxLeaseId:1040000 maxTxnTs:10000 maxRaftId:3 > 
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting server-2:7080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting server-1:7080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting zero-2:5080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting zero-3:5080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 draft.go:139: Node ID: 2 with GroupID: 1
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:27 Term:5 XXX_unrecognized:[]}
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 node.go:246: Found hardstate: {Term:232 Vote:2 Commit:27 XXX_unrecognized:[]}
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 node.go:258: Group 1 found 0 entries
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 draft.go:657: Restarting node for group: 1
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 raft.go:567: INFO: 2 became follower at term 232
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:14 raft.go:315: INFO: newRaft 2 [peers: [1,2,3], term: 232, commit: 27, applied: 27, lastindex: 27, lastterm: 5]
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:17 raft.go:749: INFO: 2 is starting a new election at term 232
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:17 raft.go:580: INFO: 2 became candidate at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:17 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:17 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 3 at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:17 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 1 at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:19 raft.go:749: INFO: 2 is starting a new election at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:19 raft.go:580: INFO: 2 became candidate at term 234
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:19 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 234
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:19 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 3 at term 234
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3    | 2018/01/29 09:48:19 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 1 at term 234
...
...
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:33 groups.go:669: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:34 groups.go:669: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:34 groups.go:453: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:35 groups.go:669: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:35 groups.go:453: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:36 raft.go:749: INFO: 2 is starting a new election at term 596
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:36 raft.go:580: INFO: 2 became candidate at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:36 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:36 raft.go:651: INFO: 2 [logterm: 298, index: 50] sent MsgVote request to 1 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:36 raft.go:651: INFO: 2 [logterm: 298, index: 50] sent MsgVote request to 3 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:36 groups.go:453: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:39 raft.go:772: INFO: 2 [logterm: 298, index: 50, vote: 2] rejected MsgVote from 3 [logterm: 298, index: 50] at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:39 raft.go:567: INFO: 2 became follower at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:39 node.go:301: INFO: raft.node: 2 elected leader 3 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3    | 2018/01/29 10:08:45 raft.go:731: INFO: 2 [term: 597] ignored a MsgVote message with lower term from 1 [term: 299]

Server that stayed alive:

dgraph_server-1.1.maeljubio2qv@swarm-manager-1    | 2018/01/29 10:08:24 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 9)
dgraph_server-1.1.maeljubio2qv@swarm-manager-1    | 2018/01/29 10:08:27 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 30)
dgraph_server-1.1.maeljubio2qv@swarm-manager-1    | 2018/01/29 10:08:30 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 11)
dgraph_server-1.1.maeljubio2qv@swarm-manager-1    | 2018/01/29 10:08:32 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 99)

dgraph_server_health_status on debug/vars endpoint says 1 during the rejected votes, same after the election. Is there a proper way to keep track of server state in terms of cluster membership (and data synchronization, if possible)?
Amount of ticks in the log of alive server looks almost random - is it how it supposed to be? Is there a way to decrease the time after crashed node could rejoin (decrease lease)? It took like 15 mins both times I hard-rebooted the node to get back into the cluster.

I am investigating this, will update you once I have something. Did you have a look at http://<zero_ip:zero_http_port>/state for information about the cluster state?

Have been having a hard time trying to reproduce this. I have been restarting the docker-machine using docker-machine restart swarm-manager-idx. Everytime, leader election happens within 20 seconds. Here is the docker-compose.yml that I have been using. My docker machines are running locally using the virtualbox driver.

version: "3"
networks:
  dgraph:
    external: true
services:
  zero-1:
    image: dgraph/dgraph:master
    command: dgraph zero --my=zero-1:5080 --replicas 3 --idx 1
    volumes:
      - data:/dgraph
    networks:
      dgraph:
    ports:
      - 5080:5080
      - 6080:6080
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-1

  zero-2:
    image: dgraph/dgraph:master
    command: dgraph zero --my=zero-2:5080 --replicas 3 --idx 2 --peer zero-1:5080
    volumes:
      - data:/dgraph
    networks:
      dgraph:
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-2

  zero-3:
    image: dgraph/dgraph:master
    command: dgraph zero --my=zero-3:5080 --replicas 3 --idx 3 --peer zero-1:5080
    volumes:
      - data:/dgraph
    networks:
      dgraph:
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-3

  server-1:
    image: dgraph/dgraph:master
    command: dgraph server --my=server-1:7080 --memory_mb=4096 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    networks:
      dgraph:
    ports:
      - 9080:9080
      - 8080:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-1

  server-2:
    image: dgraph/dgraph:master
    command: dgraph server --my=server-2:7080 --memory_mb=4096 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    networks:
      dgraph:
    ports:
      - 9081:9080
      - 8081:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-2

  server-3:
    image: dgraph/dgraph:master
    command: dgraph server --my=server-3:7080 --memory_mb=4096 --zero=zero-1:5080 --export=/dgraph/export
    networks:
      dgraph:
    volumes:
      - data:/dgraph
    ports:
      - 9082:9080
      - 8082:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-3

  ratel:
    image: dgraph/dgraph:master
    command: dgraph-ratel
    networks:
      dgraph:
    ports:
      - 18049:8081

volumes:
  data:

Reproducable with docker-machine. Tried with your docker-compose file, docker-machines as well.
docker-machine restart shuts down the virtual machine gracefully, stopping docker service and dgraph server - leader is elected in about 20 seconds, just as you said. To reproduce the issue I am getting, you need to reset the host - I did it by opening virtualbox gui, righclicking on host and choosing ‘reset’. This will simulate sudden power loss, no safe shutdowns.

About the state - all nodes are present, there are no missing nodes in the state list. I can provide states before and after simulated poweroff. Is there something I need to pay attention to? Do you need output of debug/vars?

Thanks, I will try this.

You are right, nodes aren’t removed from the state automatically. Can you please create an issue on Github and we can work on improving the cluster monitoring metrics?

Cluster monitoring metrics · Issue #2069 · dgraph-io/dgraph · GitHub - an issue about metrics. Removing disconnected nodes from the list is not the best solution, maybe move them to failed group or something like that?

1 Like

Thanks, I’d suggest filling another issue about the restart after power loss as well and link to this thread. It’s just easier to track issues on Github, they are easy to get lost here.

Server node cannot rejoin the cluster after host power loss · Issue #2080 · dgraph-io/dgraph · GitHub - reboot issue.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.