Unable to run dgraph in a multi-node kubernetes cluster

jzhu077 · January 21, 2018, 10:07pm

Environment:
Kubernetes cluster consists of 1 master and 4 slave nodes. Each slave node has 8 CPUs and 30GB of RAM.

Steps to reproduce:
kubectl create -f dgraph-ha.yaml

Expected result:
dgraph cluster up and running

Actual result:

kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
dgraph-0                        2/2       Running   0          15m
dgraph-1                        2/2       Running   0          14m
dgraph-2                        2/2       Running   0          14m

kubectl logs dgraph-0 server
++ hostname -f
+ dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7080 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:49:23 groups.go:86: Current Raft Id: 0
2018/01/21 21:49:23 gRPC server started.  Listening on port 9080
2018/01/21 21:49:23 HTTP server started.  Listening on port 8080
2018/01/21 21:49:23 worker.go:99: Worker listening at address: [::]:7080
2018/01/21 21:49:23 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:49:26 groups.go:109: Connected to group zero. Connection state: member:<id:1 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080" > state:<counter:4 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080" > > > > zeros:<key:1 value:<id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:5080" > > maxRaftId:1 > 
2018/01/21 21:49:26 draft.go:139: Node ID: 1 with GroupID: 1
2018/01/21 21:49:26 node.go:258: Group 1 found 0 entries
2018/01/21 21:49:26 draft.go:670: New Node for group: 1
2018/01/21 21:49:26 raft.go:567: INFO: 1 became follower at term 0
2018/01/21 21:49:26 raft.go:315: INFO: newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 21:49:26 raft.go:567: INFO: 1 became follower at term 1
2018/01/21 21:49:26 groups.go:292: Asking if I can serve tablet for: _predicate_
2018/01/21 21:49:26 node.go:127: Setting conf state to nodes:1 
2018/01/21 21:49:26 raft.go:749: INFO: 1 is starting a new election at term 1
2018/01/21 21:49:26 raft.go:580: INFO: 1 became candidate at term 2
2018/01/21 21:49:26 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 2
2018/01/21 21:49:26 raft.go:621: INFO: 1 became leader at term 2
2018/01/21 21:49:26 node.go:301: INFO: raft.node: 1 elected leader 1 at term 2
2018/01/21 21:49:26 mutation.go:155: Done schema update predicate:"_predicate_" value_type:STRING list:true 
2018/01/21 21:49:49 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:52 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:53 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:53 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:54 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:54 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:55 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:55 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:56 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:56 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:57 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:57 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:58 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:58 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:49:59 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:49:59 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:50:00 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:00 node.go:127: Setting conf state to nodes:1 nodes:2 
2018/01/21 21:50:00 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:50:00 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:07 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3 
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
more repetitive logs ...

kubectl logs dgraph-0 zero
++ hostname
+ [[ dgraph-0 =~ -([0-9]+)$ ]]
+ ordinal=0
+ idx=1
+ [[ 0 -eq 0 ]]
++ hostname -f
+ dgraph zero -o -2000 --replicas 3 --my=dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 1
Setting up grpc listener at: 0.0.0.0:5080
Setting up http listener at: 0.0.0.0:6080
2018/01/21 21:49:23 node.go:258: Group 0 found 0 entries
2018/01/21 21:49:23 raft.go:567: INFO: 1 became follower at term 0
2018/01/21 21:49:23 raft.go:315: INFO: newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 21:49:23 raft.go:567: INFO: 1 became follower at term 1
Running Dgraph zero...
2018/01/21 21:49:23 node.go:127: Setting conf state to nodes:1 
2018/01/21 21:49:23 zero.go:322: Got connection request: addr:"dgraph-0.dgraph.default.svc.cluster.local:7080" 
2018/01/21 21:49:23 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7080
2018/01/21 21:49:26 raft.go:749: INFO: 1 is starting a new election at term 1
2018/01/21 21:49:26 raft.go:580: INFO: 1 became candidate at term 2
2018/01/21 21:49:26 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 2
2018/01/21 21:49:26 raft.go:621: INFO: 1 became leader at term 2
2018/01/21 21:49:26 node.go:301: INFO: raft.node: 1 elected leader 1 at term 2
2018/01/21 21:49:26 zero.go:419: Connected
2018/01/21 21:49:49 pool.go:167: Echo error from dgraph-1.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:49:49 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:5080
2018/01/21 21:49:49 node.go:127: Setting conf state to nodes:1 nodes:2 
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 zero.go:322: Got connection request: addr:"dgraph-1.dgraph.default.svc.cluster.local:7080" 
2018/01/21 21:49:49 pool.go:167: Echo error from dgraph-1.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:49:49 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:7080
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:49 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:50 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:51 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:52 raft.go:793: WARN: 1 stepped down to follower since quorum is not active
2018/01/21 21:49:52 raft.go:567: INFO: 1 became follower at term 2
2018/01/21 21:49:52 node.go:307: INFO: raft.node: 1 lost leader 1 at term 2
2018/01/21 21:49:52 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/01/21 21:49:53 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/01/21 21:49:54 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/01/21 21:49:54 raft.go:749: INFO: 1 is starting a new election at term 2
2018/01/21 21:49:54 raft.go:580: INFO: 1 became candidate at term 3
2018/01/21 21:49:54 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 3
2018/01/21 21:49:54 raft.go:651: INFO: 1 [logterm: 2, index: 10] sent MsgVote request to 2 at term 3
2018/01/21 21:49:54 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:49:57 raft.go:749: INFO: 1 is starting a new election at term 3
2018/01/21 21:49:57 raft.go:580: INFO: 1 became candidate at term 4
2018/01/21 21:49:57 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 4
2018/01/21 21:49:57 raft.go:651: INFO: 1 [logterm: 2, index: 10] sent MsgVote request to 2 at term 4
2018/01/21 21:49:57 node.go:322: No healthy connection found to node Id: 2, err: Unhealthy connection
2018/01/21 21:50:00 raft.go:749: INFO: 1 is starting a new election at term 4
2018/01/21 21:50:00 raft.go:580: INFO: 1 became candidate at term 5
2018/01/21 21:50:00 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 5
2018/01/21 21:50:00 raft.go:651: INFO: 1 [logterm: 2, index: 10] sent MsgVote request to 2 at term 5
2018/01/21 21:50:00 raft.go:664: INFO: 1 received MsgVoteResp from 2 at term 5
2018/01/21 21:50:00 raft.go:1013: INFO: 1 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections
2018/01/21 21:50:00 raft.go:621: INFO: 1 became leader at term 5
2018/01/21 21:50:00 node.go:301: INFO: raft.node: 1 elected leader 1 at term 5
2018/01/21 21:50:00 zero.go:419: Connected
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3 
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 zero.go:322: Got connection request: addr:"dgraph-2.dgraph.default.svc.cluster.local:7080" 
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:07 zero.go:419: Connected
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
2018/01/21 21:50:07 node.go:322: No healthy connection found to node Id: 3, err: Unhealthy connection
more repetitive logs ...

jzhu077 · January 21, 2018, 10:07pm

More logs from other zeros and servers

kubectl logs dgraph-1 zero
++ hostname
+ [[ dgraph-1 =~ -([0-9]+)$ ]]
+ ordinal=1
+ idx=2
+ [[ 1 -eq 0 ]]
++ hostname -f
+ dgraph zero -o -2000 --replicas 3 --my=dgraph-1.dgraph.default.svc.cluster.local:5080 --peer dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 2
Setting up grpc listener at: 0.0.0.0:5080
Setting up http listener at: 0.0.0.0:6080
2018/01/21 21:49:49 node.go:258: Group 0 found 0 entries
2018/01/21 21:49:49 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
Running Dgraph zero...
2018/01/21 21:49:49 raft.go:567: INFO: 2 became follower at term 0
2018/01/21 21:49:49 raft.go:315: INFO: newRaft 2 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 21:49:49 raft.go:567: INFO: 2 became follower at term 1
2018/01/21 21:50:00 raft.go:708: INFO: 2 [term: 1] received a MsgVote message with higher term from 1 [term: 5]
2018/01/21 21:50:00 raft.go:567: INFO: 2 became follower at term 5
2018/01/21 21:50:00 raft.go:763: INFO: 2 [logterm: 0, index: 0, vote: 0] cast MsgVote for 1 [logterm: 2, index: 10] at term 5
2018/01/21 21:50:00 node.go:301: INFO: raft.node: 2 elected leader 1 at term 5
2018/01/21 21:50:00 node.go:127: Setting conf state to nodes:1 
2018/01/21 21:50:00 node.go:127: Setting conf state to nodes:1 nodes:2 
2018/01/21 21:50:00 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:00 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3 
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:7080

kubectl logs dgraph-1 server
++ hostname -f
+ dgraph server --my=dgraph-1.dgraph.default.svc.cluster.local:7080 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:49:49 gRPC server started.  Listening on port 9080
2018/01/21 21:49:49 HTTP server started.  Listening on port 8080
2018/01/21 21:49:49 groups.go:86: Current Raft Id: 0
2018/01/21 21:49:49 worker.go:99: Worker listening at address: [::]:7080
2018/01/21 21:49:49 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:00 groups.go:109: Connected to group zero. Connection state: member:<id:2 group_id:1 addr:"dgraph-1.dgraph.default.svc.cluster.local:7080" > state:<counter:14 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080" leader:true last_update:1516571366 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-1.dgraph.default.svc.cluster.local:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" > > > > zeros:<key:1 value:<id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"dgraph-1.dgraph.default.svc.cluster.local:5080" > > maxRaftId:2 > 
2018/01/21 21:50:00 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:00 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:00 draft.go:139: Node ID: 2 with GroupID: 1
2018/01/21 21:50:00 node.go:258: Group 1 found 0 entries
2018/01/21 21:50:00 draft.go:670: New Node for group: 1
2018/01/21 21:50:00 draft.go:640: Calling JoinCluster
2018/01/21 21:50:00 draft.go:648: Done with JoinCluster call
2018/01/21 21:50:00 raft.go:567: INFO: 2 became follower at term 0
2018/01/21 21:50:00 raft.go:315: INFO: newRaft 2 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 21:50:00 raft.go:567: INFO: 2 became follower at term 1
2018/01/21 21:50:00 raft.go:708: INFO: 2 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 2]
2018/01/21 21:50:00 raft.go:567: INFO: 2 became follower at term 2
2018/01/21 21:50:00 node.go:301: INFO: raft.node: 2 elected leader 1 at term 2
2018/01/21 21:50:00 node.go:127: Setting conf state to nodes:1 
2018/01/21 21:50:00 node.go:127: Setting conf state to nodes:1 nodes:2 
2018/01/21 21:50:00 mutation.go:155: Done schema update predicate:"_predicate_" value_type:STRING list:true 
2018/01/21 21:50:07 groups.go:665: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 pool.go:167: Echo error from dgraph-2.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:07 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3

kubectl logs dgraph-2 zero
++ hostname
+ [[ dgraph-2 =~ -([0-9]+)$ ]]
+ ordinal=2
+ idx=3
+ [[ 2 -eq 0 ]]
++ hostname -f
+ dgraph zero -o -2000 --replicas 3 --my=dgraph-2.dgraph.default.svc.cluster.local:5080 --peer dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 3
Setting up grpc listener at: 0.0.0.0:5080
Setting up http listener at: 0.0.0.0:6080
2018/01/21 21:50:06 node.go:258: Group 0 found 0 entries
2018/01/21 21:50:06 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 raft.go:567: INFO: 3 became follower at term 0
2018/01/21 21:50:07 raft.go:315: INFO: newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 21:50:07 raft.go:567: INFO: 3 became follower at term 1
Running Dgraph zero...
2018/01/21 21:50:07 raft.go:1070: INFO: 3 no leader at term 1; dropping index reading msg
2018/01/21 21:50:07 zero.go:465: Error while creating proposals in stream Unknown cluster member
2018/01/21 21:50:17 raft.go:708: INFO: 3 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 5]
2018/01/21 21:50:17 raft.go:567: INFO: 3 became follower at term 5
2018/01/21 21:50:17 node.go:301: INFO: raft.node: 3 elected leader 1 at term 5
2018/01/21 21:50:17 node.go:127: Setting conf state to nodes:1 
2018/01/21 21:50:17 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:17 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:17 node.go:127: Setting conf state to nodes:1 nodes:2 
2018/01/21 21:50:17 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3 
2018/01/21 21:50:17 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:17 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:7080

kubectl logs dgraph-2 server
++ hostname -f
+ dgraph server --my=dgraph-2.dgraph.default.svc.cluster.local:7080 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 groups.go:86: Current Raft Id: 0
2018/01/21 21:50:07 gRPC server started.  Listening on port 9080
2018/01/21 21:50:07 HTTP server started.  Listening on port 8080
2018/01/21 21:50:07 worker.go:99: Worker listening at address: [::]:7080
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 groups.go:109: Connected to group zero. Connection state: member:<id:3 group_id:1 addr:"dgraph-2.dgraph.default.svc.cluster.local:7080" > state:<counter:21 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080" leader:true last_update:1516571366 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-1.dgraph.default.svc.cluster.local:7080" > > members:<key:3 value:<id:3 group_id:1 addr:"dgraph-2.dgraph.default.svc.cluster.local:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" > > > > zeros:<key:1 value:<id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"dgraph-1.dgraph.default.svc.cluster.local:5080" > > zeros:<key:3 value:<id:3 addr:"dgraph-2.dgraph.default.svc.cluster.local:5080" > > maxRaftId:3 > 
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 draft.go:139: Node ID: 3 with GroupID: 1
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:7080
2018/01/21 21:50:07 pool.go:118: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:5080
2018/01/21 21:50:07 node.go:258: Group 1 found 0 entries
2018/01/21 21:50:07 draft.go:670: New Node for group: 1
2018/01/21 21:50:07 draft.go:640: Calling JoinCluster
2018/01/21 21:50:07 draft.go:648: Done with JoinCluster call
2018/01/21 21:50:07 raft.go:567: INFO: 3 became follower at term 0
2018/01/21 21:50:07 raft.go:315: INFO: newRaft 3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 21:50:07 raft.go:567: INFO: 3 became follower at term 1
2018/01/21 21:50:08 groups.go:449: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/21 21:50:17 raft.go:708: INFO: 3 [term: 1] received a MsgHeartbeat message with higher term from 1 [term: 2]
2018/01/21 21:50:17 raft.go:567: INFO: 3 became follower at term 2
2018/01/21 21:50:17 node.go:301: INFO: raft.node: 3 elected leader 1 at term 2
2018/01/21 21:50:17 node.go:127: Setting conf state to nodes:1 
2018/01/21 21:50:17 node.go:127: Setting conf state to nodes:1 nodes:2 
2018/01/21 21:50:17 node.go:127: Setting conf state to nodes:1 nodes:2 nodes:3 
2018/01/21 21:50:17 mutation.go:155: Done schema update predicate:"_predicate_" value_type:STRING list:true

jzhu077 · January 21, 2018, 10:10pm

Sorry, this is solved. Just need to wait for 15mins for everything to set up properly. It was much longer than I expected, but it works.

pawan · January 21, 2018, 11:17pm

15 mins is a very long time. It should be up in < 30 secs. I will give this a try in a similar cluster config as yours and see what is going on.

jzhu077 · January 21, 2018, 11:40pm

cheers.

When I tried to set up a more complex cluster (1 zero and 5 servers per node) I ran into the following errors:

++ hostname
+ [[ dgraph-0 =~ -([0-9]+)$ ]]
+ ordinal=0
+ idx=1
+ [[ 0 -eq 0 ]]
++ hostname -f
+ dgraph zero -o -2000 --replicas 3 --my=dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 1
Setting up grpc listener at: 0.0.0.0:5080
Setting up http listener at: 0.0.0.0:6080
2018/01/21 23:34:40 node.go:258: Group 0 found 0 entries
2018/01/21 23:34:40 raft.go:567: INFO: 1 became follower at term 0
2018/01/21 23:34:40 raft.go:315: INFO: newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018/01/21 23:34:40 raft.go:567: INFO: 1 became follower at term 1
Running Dgraph zero...
2018/01/21 23:34:40 node.go:127: Setting conf state to nodes:1 
2018/01/21 23:34:41 zero.go:322: Got connection request: id:11 addr:"dgraph-0.dgraph.default.svc.cluster.local:7091" 
2018/01/21 23:34:41 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7091
2018/01/21 23:34:41 zero.go:322: Got connection request: id:12 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" 
2018/01/21 23:34:41 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7092
2018/01/21 23:34:41 zero.go:322: Got connection request: id:13 addr:"dgraph-0.dgraph.default.svc.cluster.local:7093" 
2018/01/21 23:34:41 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7093
2018/01/21 23:34:41 zero.go:322: Got connection request: id:14 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" 
2018/01/21 23:34:41 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7094
2018/01/21 23:34:41 zero.go:322: Got connection request: id:15 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" 
2018/01/21 23:34:41 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7095
2018/01/21 23:34:43 raft.go:749: INFO: 1 is starting a new election at term 1
2018/01/21 23:34:43 raft.go:580: INFO: 1 became candidate at term 2
2018/01/21 23:34:43 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 2
2018/01/21 23:34:43 raft.go:621: INFO: 1 became leader at term 2
2018/01/21 23:34:43 node.go:301: INFO: raft.node: 1 elected leader 1 at term 2
2018/01/21 23:34:43 raft.go:531: While applying proposal: Invalid group proposal
2018/01/21 23:34:43 zero.go:412: Connected
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:43 raft.go:531: While applying proposal: Invalid group proposal
2018/01/21 23:34:43 zero.go:412: Connected
2018/01/21 23:34:43 zero.go:322: Got connection request: id:14 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" 
2018/01/21 23:34:43 zero.go:322: Got connection request: id:15 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" 
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:43 zero.go:322: Got connection request: id:11 addr:"dgraph-0.dgraph.default.svc.cluster.local:7091" 
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:43 zero.go:322: Got connection request: id:12 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" 
2018/01/21 23:34:43 zero.go:419: Connected
2018/01/21 23:34:44 zero.go:322: Got connection request: id:13 addr:"dgraph-0.dgraph.default.svc.cluster.local:7093" 
2018/01/21 23:34:44 zero.go:419: Connected
2018/01/21 23:34:44 zero.go:322: Got connection request: id:14 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" 
2018/01/21 23:34:44 zero.go:419: Connected
2018/01/21 23:34:44 zero.go:322: Got connection request: id:15 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" 
2018/01/21 23:34:44 zero.go:419: Connected
2018/01/21 23:34:50 oracle.go:381: Error while fetching minTs from group 1, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:34:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:34:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:34:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:34:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:34:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:00 zero.go:322: Got connection request: id:11 addr:"dgraph-0.dgraph.default.svc.cluster.local:7091" 
2018/01/21 23:35:00 zero.go:419: Connected
2018/01/21 23:35:00 zero.go:322: Got connection request: id:12 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" 
2018/01/21 23:35:00 zero.go:419: Connected
2018/01/21 23:35:00 zero.go:322: Got connection request: id:13 addr:"dgraph-0.dgraph.default.svc.cluster.local:7093" 
2018/01/21 23:35:00 zero.go:419: Connected
2018/01/21 23:35:00 zero.go:322: Got connection request: id:14 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" 
2018/01/21 23:35:00 zero.go:419: Connected
2018/01/21 23:35:00 zero.go:322: Got connection request: id:15 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" 
2018/01/21 23:35:00 zero.go:419: Connected
2018/01/21 23:35:00 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:35:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:10 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:35:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:20 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:35:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:30 zero.go:322: Got connection request: id:11 addr:"dgraph-0.dgraph.default.svc.cluster.local:7091" 
2018/01/21 23:35:30 zero.go:419: Connected
2018/01/21 23:35:30 zero.go:322: Got connection request: id:12 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" 
2018/01/21 23:35:30 zero.go:419: Connected
2018/01/21 23:35:30 zero.go:322: Got connection request: id:13 addr:"dgraph-0.dgraph.default.svc.cluster.local:7093" 
2018/01/21 23:35:30 zero.go:419: Connected
2018/01/21 23:35:30 zero.go:322: Got connection request: id:14 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" 
2018/01/21 23:35:30 zero.go:419: Connected
2018/01/21 23:35:30 oracle.go:375: No healthy connection found to leader of group 2
2018/01/21 23:35:30 zero.go:322: Got connection request: id:15 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" 
2018/01/21 23:35:30 zero.go:419: Connected
2018/01/21 23:35:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:40 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:35:41 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:41 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:41 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:41 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:41 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:50 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:35:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:35:51 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:00 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:36:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:01 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:10 oracle.go:375: No healthy connection found to leader of group 2
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 zero.go:322: Got connection request: id:11 addr:"dgraph-0.dgraph.default.svc.cluster.local:7091" 
2018/01/21 23:36:11 zero.go:419: Connected
2018/01/21 23:36:11 zero.go:322: Got connection request: id:12 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" 
2018/01/21 23:36:11 zero.go:419: Connected
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 zero.go:322: Got connection request: id:13 addr:"dgraph-0.dgraph.default.svc.cluster.local:7093" 
2018/01/21 23:36:11 zero.go:419: Connected
2018/01/21 23:36:11 zero.go:322: Got connection request: id:14 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" 
2018/01/21 23:36:11 zero.go:419: Connected
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 zero.go:322: Got connection request: id:15 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" 
2018/01/21 23:36:11 zero.go:419: Connected
2018/01/21 23:36:20 oracle.go:375: No healthy connection found to leader of group 2
2018/01/21 23:36:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:21 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:30 oracle.go:375: No healthy connection found to leader of group 1
2018/01/21 23:36:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:31 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

kubectl logs dgraph-0 server-2
+ serverSize=5
+ offset=12
+ port=7092
++ hostname
+ [[ dgraph-0 =~ -([0-9]+)$ ]]
+ ordinal=0
+ ordinal=1
+ idx=12
+ num=2
++ hostname -f
+ dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7092 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 12 --port_offset 12 --postings p2 --wal w2
2018/01/21 23:36:11 groups.go:86: Current Raft Id: 12
2018/01/21 23:36:11 gRPC server started.  Listening on port 9092
2018/01/21 23:36:11 HTTP server started.  Listening on port 8092
2018/01/21 23:36:11 worker.go:99: Worker listening at address: [::]:7092
2018/01/21 23:36:11 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
2018/01/21 23:36:11 groups.go:109: Connected to group zero. Connection state: member:<id:12 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" > state:<counter:11 groups:<key:1 value:<members:<key:11 value:<id:11 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7091" > > members:<key:12 value:<id:12 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7092" > > members:<key:13 value:<id:13 group_id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:7093" > > > > groups:<key:2 value:<members:<key:14 value:<id:14 group_id:2 addr:"dgraph-0.dgraph.default.svc.cluster.local:7094" > > members:<key:15 value:<id:15 group_id:2 addr:"dgraph-0.dgraph.default.svc.cluster.local:7095" > > > > zeros:<key:1 value:<id:1 addr:"dgraph-0.dgraph.default.svc.cluster.local:5080" leader:true > > > 
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7094
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7091
2018/01/21 23:36:11 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7095
2018/01/21 23:36:11 pool.go:167: Echo error from dgraph-0.dgraph.default.svc.cluster.local:7093. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/21 23:36:11 pool.go:118: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:7093
2018/01/21 23:36:11 draft.go:139: Node ID: 12 with GroupID: 1
2018/01/21 23:36:11 node.go:258: Group 1 found 0 entries
2018/01/21 23:36:11 draft.go:670: New Node for group: 1
2018/01/21 23:36:11 Cannot retrieve snapshot from peer 13, no connection.  Error: Unhealthy connection

github.com/dgraph-io/dgraph/x.Fatalf
	/home/pawan/go/src/github.com/dgraph-io/dgraph/x/error.go:103
github.com/dgraph-io/dgraph/worker.(*node).retrieveSnapshot
	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/draft.go:421
github.com/dgraph-io/dgraph/worker.(*node).InitAndStartNode
	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/draft.go:675
github.com/dgraph-io/dgraph/worker.StartRaftNodes
	/home/pawan/go/src/github.com/dgraph-io/dgraph/worker/groups.go:120
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:2337

Are you able to point out what could be wrong in the setting?

jzhu077 · January 21, 2018, 11:41pm

The enpoint commands:
zero: dgraph zero -o -2000 --replicas 3 --my=dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 1

server-1: dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7091 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 11 --port_offset 11 --postings p1 --wal w1

server-2: dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7092 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 12 --port_offset 12 --postings p2 --wal w2

server-3: dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7093 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 13 --port_offset 13 --postings p3 --wal w3

server-4: dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7094 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 14 --port_offset 14 --postings p4 --wal w4

server-5: dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7095 --memory_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080 --idx 15 --port_offset 15 --postings p5 --wal w5

The yaml file that I used to create the dgraph cluster:

apiVersion: v1
kind: Service
metadata:
  name: dgraph-public
  labels:
    app: dgraph
spec:
  type: LoadBalancer
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  - port: 6080
    targetPort: 6080
    name: zero-http
  - port: 8091
    targetPort: 8091
    name: server-1-http
  - port: 9091
    targetPort: 9091
    name: server-1-grpc
  - port: 9092
    targetPort: 9092
    name: server-2-grpc
  - port: 9093
    targetPort: 9093
    name: server-3-grpc
  - port: 9094
    targetPort: 9094
    name: server-4-grpc
  - port: 9095
    targetPort: 9095
    name: server-5-grpc
  
  - port: 8081
    targetPort: 8081
    name: ratel-http
  selector:
    app: dgraph
---
# This is a headless service which is neccessary for discovery for a StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph
  labels:
    app: dgraph
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: grpc

  clusterIP: None
  selector:
    app: dgraph
---
# This StatefulSet runs 3 replicas of Zero and Server.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph
spec:
  serviceName: "dgraph"
  replicas: 3
  template:
    metadata:
      labels:
        app: dgraph
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph
              topologyKey: kubernetes.io/hostname
      containers:
      - name: zero
        image: dgraph/dgraph:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5080
          name: intra-node
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            idx=$(($ordinal + 1))
            if [[ $ordinal -eq 0 ]]; then
              dgraph zero -o -2000 --replicas 3 --my=$(hostname -f):5080 --idx $idx
            else
              dgraph zero -o -2000 --replicas 3 --my=$(hostname -f):5080 --peer dgraph-0.dgraph.default.svc.cluster.local:5080 --idx $idx
            fi
      - name: server-1
        image: dgraph/dgraph:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            serverSize=5
            offset=11
            port=$(($offset + 7080))
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            ordinal=$(($ordinal * $serverSize))
            idx=$(($ordinal + 11))
            num=$(($ordinal + 1))
            dgraph server --my=$(hostname -f):$port \
              --memory_mb 2048 \
              --zero dgraph-0.dgraph.default.svc.cluster.local:5080 \
              --idx $idx \
              --port_offset $offset \
              --postings p$num \
              --wal w$num

      - name: server-2
        image: dgraph/dgraph:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            serverSize=5
            offset=12
            port=$(($offset + 7080))
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            ordinal=$(($ordinal * $serverSize + 1))
            idx=$(($ordinal + 11))
            num=$(($ordinal + 1))
            dgraph server --my=$(hostname -f):$port \
              --memory_mb 2048 \
              --zero dgraph-0.dgraph.default.svc.cluster.local:5080 \
              --idx $idx \
              --port_offset $offset \
              --postings p$num \
              --wal w$num
      - name: server-3
        image: dgraph/dgraph:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            serverSize=5
            offset=13
            port=$(($offset + 7080))
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            ordinal=$(($ordinal * $serverSize + 2))
            idx=$(($ordinal + 11))
            num=$(($ordinal + 1))
            dgraph server --my=$(hostname -f):$port \
              --memory_mb 2048 \
              --zero dgraph-0.dgraph.default.svc.cluster.local:5080 \
              --idx $idx \
              --port_offset $offset \
              --postings p$num \
              --wal w$num
      - name: server-4
        image: dgraph/dgraph:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            serverSize=5
            offset=14
            port=$(($offset + 7080))
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            ordinal=$(($ordinal * $serverSize + 3))
            idx=$(($ordinal + 11))
            num=$(($ordinal + 1))
            dgraph server --my=$(hostname -f):$port \
              --memory_mb 2048 \
              --zero dgraph-0.dgraph.default.svc.cluster.local:5080 \
              --idx $idx \
              --port_offset $offset \
              --postings p$num \
              --wal w$num
      - name: server-5
        image: dgraph/dgraph:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            serverSize=5
            offset=15
            port=$(($offset + 7080))
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            ordinal=$(($ordinal * $serverSize + 4))
            idx=$(($ordinal + 11))
            num=$(($ordinal + 1))
            dgraph server --my=$(hostname -f):$port \
              --memory_mb 2048 \
              --zero dgraph-0.dgraph.default.svc.cluster.local:5080 \
              --idx $idx \
              --port_offset $offset \
              --postings p$num \
              --wal w$num
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1001Gi
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: dgraph-ratel
  labels:
    app: dgraph
spec:
  selector:
    matchLabels:
      app: dgraph
  template:
    metadata:
      labels:
        app: dgraph
    spec:
      containers:
      - name: ratel
        image: dgraph/dgraph:latest
        ports:
        - containerPort: 8081
        command:
          - dgraph-ratel

jzhu077 · January 22, 2018, 1:00am

It is strange. The cluster setup does not work if I run the kubernetes deployment file as it is, however, if I comment out the other 4 servers setup steps (ie only deploy 1 zero and 1 server) then it would work. After that I gradually uncomment the server setup steps one server at a time then apply the update (ie 1 zero + 2 servers, then 1 zero + 3 servers, and so on), then I was able to deploy a dgraph cluster with 1 zero and 5 servers in a pod for every node in my kubernetes cluster.

Any clue?

Also, assuming above issue is solved, if I try to deploy a dgraph cluster with 3 zeros and 15 servers in 5 groups (ie replicas=3) how can I implement this in a kubernetes cluster? I noticed the kubernetes deploy 1 pod at a time. In my case, the 5 servers in the first pod automatically forms group 1(server 1,2,and 3) group 2(server4 and 5) in the first node.
This is not ideal as if the node(MV instance) running the pod crashes, I lost all 3 copies of my data. Do you have a suggestion for a better design?

Cheers

pawan · January 22, 2018, 2:50am

I am going to try to reproduce this and see what is happening.

I think this is related to the above issue. Somehow the first server is not ready to answer requests when the other replicas bootup and try to get the latest snapshot. When you do it gradually, its up.

I think what you want to do here is have different services for Dgraph Server and Zero. The Zero service can just have 3 replicas whereas Dgraph Server service can have 15 replicas. In that case, pods would be randomly distributed on all nodes. I will update the dgraph-ha.yaml to have this structure instead of running Zero and Server as part of the same pod as this can be extended easily.

pawan · January 23, 2018, 5:01am

Hey @jzhu077

The snapshot error was a bug which I have fixed and pushed to dgraph/dgraph:test image. It will soon also be available on dgraph/dgraph:master image.

Here is a config that I tested. This helps me run 3 Dgraph Zero and 15 Dgraph Server’s with 5x replication. The cluster is up in < 2 mins. Give it a try. Also, not sure if you know but you can check the groups and the nodes in the group by going to http://zero_ip:6080/state. It should show three groups with 5 nodes in each group.

apiVersion: v1
kind: Service
metadata:
  name: dgraph-server-public
  labels:
    app: dgraph-server
spec:
  type: LoadBalancer
  ports:
  - port: 9080
    targetPort: 9080
    name: server-grpc
  - port: 8080
    targetPort: 8080
    name: server-http
  selector:
    app: dgraph-server
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero-public
  labels:
    app: dgraph-zero
spec:
  type: LoadBalancer
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  - port: 6080
    targetPort: 6080
    name: zero-http
  selector:
    app: dgraph-zero
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-ratel
  labels:
    app: dgraph-ratel
spec:
  type: LoadBalancer
  ports:
  - port: 8081
    targetPort: 8081
    name: ratel-http
  selector:
    app: dgraph-ratel
---
# This is a headless service which is neccessary for discovery for a dgraph-zero StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero
  labels:
    app: dgraph-zero
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc

  clusterIP: None
  selector:
    app: dgraph-zero
---
# This is a headless service which is neccessary for discovery for a dgraph-server StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-server
  labels:
    app: dgraph-server
spec:
  ports:
  - port: 7080
    targetPort: 7080
    name: grpc

  clusterIP: None
  selector:
    app: dgraph-server
---
# This StatefulSet runs 3 Dgraph Zero's.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph-zero
spec:
  serviceName: "dgraph-zero"
  replicas: 3
  template:
    metadata:
      labels:
        app: dgraph-zero
    spec:
      containers:
      - name: zero
        image: dgraph/dgraph:test
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5080
          name: intra-node
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            idx=$(($ordinal + 1))
            if [[ $ordinal -eq 0 ]]; then
              dgraph zero -o -2000 --my=$(hostname -f):5080 --idx $idx --replicas 5
            else
              dgraph zero -o -2000 --my=$(hostname -f):5080 --peer dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 --idx $idx --replicas 5
            fi
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 5Gi

---
# This StatefulSet runs 15 replicas of Dgraph Server.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: dgraph-server
spec:
  serviceName: "dgraph-server"
  replicas: 15
  template:
    metadata:
      labels:
        app: dgraph-server
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-server
              topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: dgraph/dgraph:test
        imagePullPolicy: IfNotPresent
        ports:
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph server --my=$(hostname -f):7080 --memory_mb 2048 --zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 5Gi
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: dgraph-ratel
  labels:
    app: dgraph-ratel
spec:
  selector:
    matchLabels:
      app: dgraph-ratel
  template:
    metadata:
      labels:
        app: dgraph-ratel
    spec:
      containers:
      - name: ratel
        image: dgraph/dgraph:test
        ports:
        - containerPort: 8081
        command:
          - dgraph-ratel

jzhu077 · January 24, 2018, 9:46pm

Great, that works, the only thing I need to change is to create an extra headless service for dgraph-server:

apiVersion: v1
kind: Service
metadata:
  name: dgraph-server
  labels:
    app: dgraph-server
spec:
  ports:
  - port: 9090
    targetPort: 9090
    name: server-grpc

  clusterIP: None
  selector:
    app: dgraph-server

Otherwise I will hit a connection problem:

2018/01/24 22:16:52 pool.go:118: == CONNECT ==> Setting dgraph-server-14.dgraph-server.default.svc.cluster.local:7090
2018/01/24 22:16:53 pool.go:167: Echo error from dgraph-zero-3.dgraph-zero.default.svc.cluster.local:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

Since the statefulset automatically provisions the persistent volume. Do you know what would be a good way to backup the dgraph database and be able to recover from them if something goes wrong?

pawan · January 24, 2018, 10:24pm

That’s strange. We already have a headless service for dgraph-server in the file that I shared. Does the error persist or go away?

I’ll have to read up on how to retrieve files from the persistent volume after taking an export.

jzhu077 · January 24, 2018, 10:34pm

I think it got overwritten somehow. It was deployed properly this time after I removed all related services.

That will be awesome, cheers.

sboorlagadda · January 24, 2018, 10:36pm

The manifest you shared above has a headless service exposing only 7080 and not 9080. There needs to be a headless service behind the load-balancer for it to route the traffic. So we either have to add a new service exposing 9080 or change the one to expose all three (7080, 8080, 9080). Does it makes sense?

sboorlagadda · January 24, 2018, 10:38pm

I think your initial error was showing port 9090 and not 9080. So probably thats why it was not able to connect.

pawan · January 24, 2018, 10:42pm

Are you sure about this? From what I understand, the headless service is used to create DNS addresses that other servers can talk to, so we only need them for the internal grpc port. External ports are taken care of by the public service.

jzhu077 · January 24, 2018, 10:57pm

I was using the server with an offset of 10. so it becomes 7090, 8090 and 9090. And I am agreed with @pawan in

but one thing that I don’t understand is why having a headless service 9090(9080) instead of 7090(7080) would also work in this case.

sboorlagadda · January 24, 2018, 11:13pm

You are right. I missed to see the public one. Thanks.

pawan · January 24, 2018, 11:22pm

From what I have been reading, it seems that a port needs to be defined and it can even be a dummy port (that is why 9090 alway works). A fix was apparently merged but I am not sure it works.

References

github.com/kubernetes/kubernetes

Endpoints controller demands ports for headless services

opened 03:05PM - 15 Sep 16 UTC

closed 10:16PM - 29 Jun 17 UTC

thockin

sig/network help wanted

API validation explicitly allows headless services to not have ports, but we see…m to have not updated endpoints controller to know this: https://github.com/kubernetes/kubernetes/blob/master/pkg/api/validation/validation.go#L2390 vs https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go#L389 Superficially this seems like an easy fix. May be a candidate for 1.4.x @sebgoa

sboorlagadda · January 28, 2018, 5:52am

I think the reason why it worked is because this internal headless service is never used.

Servers discover other servers through zero and each server talks to their peers using just hostnames and are resolved through DNS. Also servers are not aware of this not-headless service and all grpc-internal peer-to-peer messaging happens through IPs resolved through hostnames.

jzhu077 · January 29, 2018, 10:21pm

I have now experienced an unexpected behaviour when running the same query repeatedly does not give the same output.

possible outputs:

: readTs: 83 less than minTs: 50673 for key: "\x00\x00\a___kind\x02\x02foo.bar"
: rpc error: code = Unknown desc = Got error: Schema not defined for predicate: name. while running: name:"eq" args:"foo.bar"
: rpc error: code = Unavailable desc = transport is closing
: dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection

3 and 4 are clearly related to some dgraph server pods that log groups.go:449: Unable to sync memberships. Error: rpc error: code = Unknown desc = Unknown cluster member indefinitely.
Despite 3 and 4, I would expect to see a consistent error message.
Do you have any clue to the cause of this behaviour?

Topic		Replies	Views
Dgraph cluster cannot init Dgraph	6	644	July 8, 2021
Load Balancing in Kubernetes environment Dgraph	2	426	March 22, 2021
Issues with Dgraph running in Kubernetes (K8 Loadbalancing?) Dgraph kind:bug	6	1124	October 7, 2020
How to increase --replicas in kubernetes cluster Users	8	730	December 27, 2019
Dgraph fails to start on restarts with Kind (Kubernetes) Dgraph	12	1668	October 30, 2020

Unable to run dgraph in a multi-node kubernetes cluster

Related topics