Zero cannot complete the cluster join process when the disk write speed is slow

Moved from GitHub dgraph/3471

Posted by chuerxiao:

If you suspect this could be a bug, follow the template.

  • What version of Dgraph are you using?
    v1.0.14

  • Have you tried reproducing the issue with latest release?
    Yes

  • What is the hardware spec (RAM, OS)?
    Machine: 3 VMs
    OS: CentOS 7.3
    Memory: 8GB
    Volume: Cloud disk

  • Steps to reproduce the issue (command/config used to run Dgraph).

The ‘/dgraph/data/’ volume using cloud disk



./dgraph zero --log_dir /dgraph/log -w /dgraph/data/zw -v=2 --telemetry=false --bindall=true --replicas 3 --idx 5

./dgraph zero --log_dir /dgraph/log -w /dgraph/data/zw -v=2 --telemetry=false --bindall=true --replicas 3 --idx 7 --peer 120.0.0.5:5080

./dgraph zero --log_dir /dgraph/log -w /dgraph/data/zw -v=2 --telemetry=false --bindall=true --replicas 3 --idx 6 --peer 120.0.0.5:5080
  • Expected behaviour and actual result.

E0509 16:17:38.145847      19 storage.go:213] While seekEntry. Error: Unable to find raft entry
I0509 16:17:38.237642      19 node.go:153] Setting raft.Config to: &{ID:5 peers:[] learners:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc0000ac810 Applied:0 MaxSizePerMsg:1048576 MaxCommittedSizePerReady:0 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1fef9d0 DisableProposalForwarding:false}
I0509 16:17:38.238131      19 node.go:297] Group 0 found 1 entries
I0509 16:17:38.238195      19 storage.go:210] Setting first index: 1
I0509 16:17:38.238296      19 node.go:85] 5 became follower at term 0
I0509 16:17:38.238391      19 node.go:85] newRaft 5 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
I0509 16:17:38.238410      19 node.go:85] 5 became follower at term 1
I0509 16:17:38.238931      19 run.go:327] Running Dgraph Zero...
E0509 16:17:38.242050      19 raft.go:499] While proposing CID: Not Zero leader. Aborting proposal: cid:"0db02037-ec9c-4f32-a87c-8813ef128bc4" . Retrying...
I0509 16:17:38.288747      19 node.go:175] Setting conf state to nodes:5
I0509 16:17:38.288850      19 raft.go:669] Done applying conf change at 0x5
I0509 16:17:39.240739      19 node.go:85] 5 no leader at term 1; dropping index reading msg
I0509 16:17:40.519811      19 node.go:85] 5 is starting a new election at term 1
I0509 16:17:40.519851      19 node.go:85] 5 became pre-candidate at term 1
I0509 16:17:40.519875      19 node.go:85] 5 received MsgPreVoteResp from 5 at term 1
I0509 16:17:40.519913      19 node.go:85] 5 became candidate at term 2
I0509 16:17:40.519929      19 node.go:85] 5 received MsgVoteResp from 5 at term 2
I0509 16:17:40.519952      19 node.go:85] 5 became leader at term 2
I0509 16:17:40.519970      19 node.go:85] raft.node: 5 elected leader 5 at term 2
I0509 16:17:40.520052      19 raft.go:641] I've become the leader, updating leases.
I0509 16:17:40.520075      19 assign.go:42] Updated Lease id: 1. Txn Ts: 1
W0509 16:17:41.240772      19 node.go:636] [0x5] Read index context timed out
I0509 16:17:41.364297      19 raft.go:493] CID set for cluster: 98fc639a-70b2-42e3-a707-206fac4b3a6e
W0509 16:17:41.364313      19 raft.go:697] Raft.Ready took too long to process: 122ms. Most likely due to slow disk: 122ms. Num entries: 1. MustSync: true
I0509 16:17:49.357248      19 pool.go:138] CONNECTED to 120.0.0.7:5080
I0509 16:17:49.357274      19 node.go:553] Current confstate at 0x5: nodes:5
I0509 16:17:49.357392      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:17:49.357545      19 node.go:553] Current confstate at 0x5: nodes:5
W0509 16:17:52.240768      19 node.go:636] [0x5] Read index context timed out
I0509 16:17:52.357801      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:17:52.357839      19 node.go:553] Current confstate at 0x5: nodes:5
I0509 16:17:52.358016      19 node.go:85] propose conf Type:EntryConfChange Data:"\010\231\367\276\336\247\304\210\216\256\001\020\000\030\007\"\031\t\007\000\000\000\000\000\000\000\032\016120.0.0.7:5080"  ignored since pending unapplied configuration [index 4, applied 3]

At this point, node 7 has joined the cluster. However, node 5 does not get the status of raft return in time.

I0509 16:17:52.366177      19 node.go:175] Setting conf state to nodes:5 nodes:7
I0509 16:17:52.366262      19 raft.go:669] Done applying conf change at 0x5

W0509 16:17:52.366390      19 raft.go:697] Raft.Ready took too long to process: 3.009s. Most likely due to slow disk: 3.008s. Num entries: 1. MustSync: true
W0509 16:17:52.764421      19 raft.go:697] Raft.Ready took too long to process: 398ms. Most likely due to slow disk: 398ms. Num entries: 1. MustSync: true
W0509 16:17:52.780039      19 node.go:387] Unable to send message to peer: 0x7. Error: EOF
W0509 16:17:54.366953      19 node.go:636] [0x5] Read index context timed out
I0509 16:17:55.358073      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:17:55.358109      19 node.go:553] Current confstate at 0x5: nodes:5 nodes:7
W0509 16:17:55.366911      19 raft.go:552] Zero node: 0x5 unable to reach quorum. Error: context deadline exceeded
W0509 16:17:56.367270      19 node.go:636] [0x5] Read index context timed out
I0509 16:17:58.358325      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:17:58.358365      19 node.go:553] Current confstate at 0x5: nodes:5 nodes:7
W0509 16:17:58.367168      19 raft.go:552] Zero node: 0x5 unable to reach quorum. Error: context deadline exceeded
W0509 16:17:58.367439      19 node.go:636] [0x5] Read index context timed out
E0509 16:18:00.367562      19 node.go:621] Error while trying to call ReadIndex: context deadline exceeded
E0509 16:18:00.367608      19 node.go:674] [0x5] While trying to do lin read index: context deadline exceeded
I0509 16:18:01.358741      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:18:01.358809      19 node.go:553] Current confstate at 0x5: nodes:5 nodes:7
W0509 16:18:01.367429      19 raft.go:552] Zero node: 0x5 unable to reach quorum. Error: context deadline exceeded
E0509 16:18:02.367867      19 node.go:621] Error while trying to call ReadIndex: context deadline exceeded
E0509 16:18:02.367911      19 node.go:674] [0x5] While trying to do lin read index: context deadline exceeded
I0509 16:18:04.359103      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:18:04.359147      19 node.go:553] Current confstate at 0x5: nodes:5 nodes:7
W0509 16:18:04.367745      19 raft.go:552] Zero node: 0x5 unable to reach quorum. Error: context deadline exceeded
E0509 16:18:04.368115      19 node.go:621] Error while trying to call ReadIndex: context deadline exceeded
E0509 16:18:04.368147      19 node.go:674] [0x5] While trying to do lin read index: context deadline exceeded
I0509 16:18:05.019075      19 node.go:85] propose conf Type:EntryConfChange Data:"\010\335\350\223\224\315\261\271\330\r\020\000\030\007\"\031\t\007\000\000\000\000\000\000\000\032\016120.0.0.7:5080"  ignored since pending unapplied configuration [index 6, applied 5]
W0509 16:18:05.019139      19 raft.go:697] Raft.Ready took too long to process: 9.661s. Most likely due to slow disk: 9.661s. Num entries: 1. MustSync: true
W0509 16:18:06.368486      19 node.go:636] [0x5] Read index context timed out
W0509 16:18:06.625403      19 raft.go:697] Raft.Ready took too long to process: 1.606s. Most likely due to slow disk: 1.606s. Num entries: 1. MustSync: true
W0509 16:18:06.640065      19 node.go:387] Unable to send message to peer: 0x7. Error: EOF
I0509 16:18:07.359326      19 node.go:552] Trying to add 0x7 to cluster. Addr: 120.0.0.7:5080
I0509 16:18:10.844486      19 node.go:563] Current confstate at 0x5: nodes:5 nodes:7
I0509 16:18:10.844596      19 node.go:85] propose conf Type:EntryConfChange Data:"\010\203\313\315\266\372\221\342s\020\000\030\007\"\031\t\007\000\000\000\000\000\000\000\032\016120.0.0.7:5080"  ignored since pending unapplied configuration [index 6, applied 5]

chuerxiao commented :

Due to network reasons, sometimes the cloud disk has a slow read and write speed. The function AddToCluster may need to get the confState in the loop.

diff --git a/conn/node.go b/conn/node.go
index 7ed3317..d866c83 100644
--- a/conn/node.go
+++ b/conn/node.go
@@ -530,7 +530,7 @@ func (n *Node) proposeConfChange(ctx context.Context, pb raftpb.ConfChange) erro
                return errInternalRetry
        }
 }

 func (n *Node) AddToCluster(ctx context.Context, pid uint64) error {
        addr, ok := n.Peer(pid)
        x.AssertTruef(ok, "Unable to find conn pool for peer: %#x", pid)
@@ -548,11 +548,27 @@ func (n *Node) AddToCluster(ctx context.Context, pid uint64) error {
                Context: rcBytes,
        }
        err = errInternalRetry
+
        for err == errInternalRetry {
+               confState := n.ConfState()
+               for _, node := range confState.Nodes {
+                       if pid== node{
+                               glog.Warningf("Node %#x has joined the cluster\n",n.Id)
+                               err = nil
+                               break
+                       }
+               }
                glog.Infof("Trying to add %#x to cluster. Addr: %v\n", pid, addr)
-               glog.Infof("Current confstate at %#x: %+v\n", n.Id, n.ConfState())
+               glog.Infof("Current confstate at %#x: %+v\n", n.Id, confState)
                err = n.proposeConfChange(ctx, cc)
        }
+
        return err
 }

danielmai commented :

It sounds like the performance of this cloud disk or the network connecting these VMs isn’t that good. A Dgraph cluster can run with modest system requirements. e.g., A single AWS EC2 t2.small instance with default gp2 8GB EBS disk can run a six-node Dgraph cluster (3 Zeros, 3 Alphas).

chuerxiao commented :

Occasionally, the network has some delays. When a node joins the cluster during this time, it will enter the infinite loop of joining the cluster. The network is restored later, but this infinite loop will not end.

campoy commented :

Hey @danielmai,

Is this something we should try to fix? @chuerxiao can you confirm this is still reproducible?