Alpha Raft.Ready took too long to process

What version of Dgraph are you using?

v21.12.0-2-gd93ac1a2d

Have you tried reproducing the issue with the latest release?

Not yet.

What is the hardware spec (RAM, OS)?

Docker compose on macOS 11.6

Steps to reproduce the issue (command/config used to run Dgraph).

Run a cluster with 1 zero and 3 alphas with this docker-compose.yml file:

version: "3.2"
networks:
  dgraph:
services:
  zero:
    image: comake/dgraph:latest
    volumes:
      - zero-data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws01
    command: dgraph zero --my=zero:5080 --replicas 3
  alpha1:
    image: comake/dgraph:latest
    hostname: "alpha1"
    volumes:
      - alpha1-data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    command: dgraph alpha --my=alpha1:7080 --zero=zero:5080 --expand-edge=true --security whitelist=172.16.0.0:172.30.0.0
  alpha2:
    image: comake/dgraph:latest
    hostname: "alpha2"
    volumes:
      - alpha2-data-volume:/dgraph
    ports
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    command: dgraph alpha --my=alpha2:7081 --zero=zero:5080 --expand-edge=true --security whitelist=172.16.0.0:172.30.0.0 -o 1
  alpha3:
    image: comake/dgraph:latest
    hostname: "alpha3"
    volumes:
      - alpha3-data-volume:/dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    command: dgraph alpha --my=alpha3:7082 --zero=zero:5080 --expand-edge=true --security whitelist=172.16.0.0:172.30.0.0 -o 2
volumes:
  zero-data-volume:
  alpha1-data-volume:
  alpha2-data-volume:
  alpha3-data-volume:

Expected behaviour and actual result.

I expected dgraph to work normally.

Dgraph alphas repeatedly got stuck at Raft.Ready took too long to process for over 1 second. In one instance it took over 3 seconds.

dgraph-zero-1    | I0322 20:52:32.076535      14 raft.go:823] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0
dgraph-alpha1-1  | I0322 20:52:36.489580      12 draft.go:805] Creating snapshot at Index: 134, ReadTs: 10080
dgraph-alpha2-1  | I0322 20:52:36.489918      13 draft.go:805] Creating snapshot at Index: 134, ReadTs: 10080
dgraph-alpha3-1  | I0322 20:52:36.491595      13 draft.go:805] Creating snapshot at Index: 134, ReadTs: 10080
dgraph-alpha1-1  | I0322 20:52:43.741823      12 groups.go:493] Serving tablet for: 0-http://www.w3.org/ns/solid/terms#privateTypeIndex
dgraph-alpha1-1  | I0322 20:52:44.037664      12 log.go:34] Handover found 266 B data in current memtable. Pushing to flushChan.
dgraph-alpha2-1  | I0322 20:52:44.041503      13 log.go:34] Handover found 262 B data in current memtable. Pushing to flushChan.
dgraph-alpha3-1  | I0322 20:52:44.053374      13 log.go:34] Handover found 282 B data in current memtable. Pushing to flushChan.
dgraph-alpha1-1  | I0322 20:52:48.012686      12 log.go:34] 3 is starting a new election at term 3
dgraph-alpha1-1  | I0322 20:52:48.012742      12 log.go:34] 3 became pre-candidate at term 3
dgraph-alpha3-1  | I0322 20:52:48.223946      13 log.go:34] 1 is starting a new election at term 3
dgraph-alpha3-1  | I0322 20:52:48.224031      13 log.go:34] 1 became pre-candidate at term 3
dgraph-alpha3-1  | I0322 20:52:48.224176      13 log.go:34] 1 received MsgPreVoteResp from 1 at term 3
dgraph-alpha1-1  | I0322 20:52:48.418419      12 log.go:34] 3 received MsgPreVoteResp from 3 at term 3
dgraph-alpha3-1  | I0322 20:52:48.418859      13 log.go:34] 1 [logterm: 3, index: 181] sent MsgPreVote request to 3 at term 3
dgraph-alpha3-1  | I0322 20:52:48.444719      13 log.go:34] 1 [logterm: 3, index: 181] sent MsgPreVote request to 2 at term 3
dgraph-alpha3-1  | I0322 20:52:48.444766      13 log.go:34] raft.node: 1 lost leader 2 at term 3
dgraph-alpha1-1  | I0322 20:52:48.436991      12 log.go:34] 3 [logterm: 3, index: 181] sent MsgPreVote request to 1 at term 3
dgraph-alpha1-1  | I0322 20:52:48.444719      12 log.go:34] 3 [logterm: 3, index: 181] sent MsgPreVote request to 2 at term 3
dgraph-alpha1-1  | I0322 20:52:48.444755      12 log.go:34] raft.node: 3 lost leader 2 at term 3
dgraph-alpha3-1  | I0322 20:52:48.471103      13 log.go:34] 1 [logterm: 3, index: 181, vote: 2] cast MsgPreVote for 3 [logterm: 3, index: 181] at term 3
dgraph-alpha1-1  | I0322 20:52:48.490098      12 log.go:34] 3 [logterm: 3, index: 181, vote: 2] cast MsgPreVote for 1 [logterm: 3, index: 181] at term 3
dgraph-alpha1-1  | I0322 20:52:48.577424      12 log.go:34] 3 received MsgPreVoteResp from 1 at term 3
dgraph-alpha1-1  | I0322 20:52:48.577775      12 log.go:34] 3 [quorum:2] has received 2 MsgPreVoteResp votes and 0 vote rejections
dgraph-zero-1    | I0322 20:52:48.574144      14 zero.go:511] Got connection request: cluster_info_only:true 
dgraph-zero-1    | I0322 20:52:48.595320      14 zero.go:536] Connected: cluster_info_only:true 
dgraph-alpha3-1  | I0322 20:52:48.595214      13 log.go:34] 1 received MsgPreVoteResp from 3 at term 3
dgraph-alpha3-1  | I0322 20:52:48.595267      13 log.go:34] 1 [quorum:2] has received 2 MsgPreVoteResp votes and 0 vote rejections
dgraph-alpha3-1  | I0322 20:52:48.604807      13 log.go:34] 1 became candidate at term 4
dgraph-alpha3-1  | I0322 20:52:48.604859      13 log.go:34] 1 received MsgVoteResp from 1 at term 4
dgraph-alpha3-1  | I0322 20:52:48.604907      13 log.go:34] 1 [logterm: 3, index: 181] sent MsgVote request to 3 at term 4
dgraph-alpha3-1  | I0322 20:52:48.604952      13 log.go:34] 1 [logterm: 3, index: 181] sent MsgVote request to 2 at term 4
dgraph-alpha1-1  | E0322 20:52:48.595492      12 node.go:517] Error while calling IsPeer rpc error: code = DeadlineExceeded desc = context deadline exceeded. Reporting 2 as unreachable.
dgraph-alpha1-1  | I0322 20:52:48.626178      12 log.go:34] 3 became candidate at term 4
dgraph-alpha1-1  | I0322 20:52:48.626271      12 log.go:34] 3 received MsgVoteResp from 3 at term 4
dgraph-alpha1-1  | I0322 20:52:48.626318      12 log.go:34] 3 [logterm: 3, index: 181] sent MsgVote request to 2 at term 4
dgraph-alpha1-1  | I0322 20:52:48.629450      12 log.go:34] 3 [logterm: 3, index: 181] sent MsgVote request to 1 at term 4
dgraph-alpha1-1  | I0322 20:52:48.637545      12 log.go:34] 3 [logterm: 3, index: 181, vote: 3] rejected MsgVote from 1 [logterm: 3, index: 181] at term 4
dgraph-alpha3-1  | I0322 20:52:48.645391      13 log.go:34] 1 [logterm: 3, index: 181, vote: 1] rejected MsgVote from 3 [logterm: 3, index: 181] at term 4
dgraph-alpha3-1  | I0322 20:52:48.645485      13 log.go:34] 1 received MsgVoteResp rejection from 3 at term 4
dgraph-alpha3-1  | I0322 20:52:48.645520      13 log.go:34] 1 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections
dgraph-alpha1-1  | W0322 20:52:48.644950      12 node.go:422] Unable to send message to peer: 0x2. Error: while calling IsPeer 2: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph-alpha1-1  | I0322 20:52:48.650438      12 log.go:34] 3 received MsgVoteResp rejection from 1 at term 4
dgraph-alpha1-1  | I0322 20:52:48.665445      12 log.go:34] 3 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections
dgraph-alpha2-1  | W0322 20:52:48.909776      13 draft.go:1781] Raft.Ready took too long to process: Timer Total: 3.184s. Breakdown: [{disk 0s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
dgraph-alpha2-1  | I0322 20:52:49.075562      13 log.go:34] 2 [logterm: 3, index: 181, vote: 2] cast MsgPreVote for 1 [logterm: 3, index: 181] at term 3
dgraph-alpha2-1  | I0322 20:52:49.077132      13 log.go:34] 2 [term: 3] received a MsgVote message with higher term from 1 [term: 4]
dgraph-alpha2-1  | I0322 20:52:49.100430      13 log.go:34] 2 became follower at term 4
dgraph-alpha2-1  | I0322 20:52:49.100479      13 log.go:34] 2 [logterm: 3, index: 181, vote: 0] cast MsgVote for 1 [logterm: 3, index: 181] at term 4
dgraph-alpha2-1  | I0322 20:52:49.100535      13 log.go:34] raft.node: 2 lost leader 2 at term 4
dgraph-alpha1-1  | I0322 20:52:49.153830      12 log.go:34] 3 became follower at term 4
dgraph-alpha1-1  | I0322 20:52:49.154202      12 log.go:34] raft.node: 3 elected leader 1 at term 4
dgraph-alpha3-1  | I0322 20:52:49.150840      13 log.go:34] 1 received MsgVoteResp from 2 at term 4
dgraph-alpha3-1  | I0322 20:52:49.150961      13 log.go:34] 1 [quorum:2] has received 2 MsgVoteResp votes and 1 vote rejections
dgraph-alpha3-1  | I0322 20:52:49.151056      13 log.go:34] 1 became leader at term 4
dgraph-alpha3-1  | I0322 20:52:49.151649      13 log.go:34] raft.node: 1 elected leader 1 at term 4
dgraph-alpha2-1  | I0322 20:52:49.215061      13 log.go:34] raft.node: 2 elected leader 1 at term 4
dgraph-alpha2-1  | E0322 20:52:49.236851      13 node.go:517] Error while calling IsPeer rpc error: code = DeadlineExceeded desc = context deadline exceeded. Reporting 3 as unreachable.
dgraph-alpha2-1  | W0322 20:52:49.332143      13 node.go:422] Unable to send message to peer: 0x3. Error: while calling IsPeer 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph-alpha3-1  | I0322 20:52:49.367207      13 groups.go:953] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
dgraph-alpha3-1  | I0322 20:52:49.367444      13 groups.go:965] Got Zero leader: zero:5080
dgraph-alpha2-1  | E0322 20:52:49.571987      13 groups.go:1060] No longer the leader of group 1. Exiting
dgraph-alpha2-1  | E0322 20:52:49.605072      13 groups.go:997] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
dgraph-alpha3-1  | W0322 20:52:51.259433      13 draft.go:1781] Raft.Ready took too long to process: Timer Total: 1.035s. Breakdown: [{disk 1.035s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
dgraph-alpha1-1  | I0322 20:52:51.963806      12 groups.go:493] Serving tablet for: 0-http://www.w3.org/2011/http-headers#content-length
dgraph-alpha1-1  | I0322 20:52:52.315266      12 log.go:34] Handover found 264 B data in current memtable. Pushing to flushChan.
dgraph-alpha3-1  | I0322 20:52:52.822984      13 log.go:34] Handover found 264 B data in current memtable. Pushing to flushChan.
dgraph-alpha2-1  | I0322 20:52:52.836830      13 log.go:34] Handover found 264 B data in current memtable. Pushing to flushChan.
dgraph-alpha3-1  | I0322 20:52:53.232370      13 log.go:34] [1] [E] LOG Compact 0->6 (4, 1 -> 1 tables with 1 splits). [00039 00040 00041 00042 . 00038 .] -> [00045 .], took 6.539s
dgraph-alpha1-1  | W0322 20:52:55.502509      12 draft.go:1781] Raft.Ready took too long to process: Timer Total: 2.32s. Breakdown: [{proposals 1.844s} {advance 4ms} {disk 0s}] Num entries: 0. MustSync: false
dgraph-alpha2-1  | W0322 20:52:57.328892      13 draft.go:1781] Raft.Ready took too long to process: Timer Total: 1.004s. Breakdown: [{disk 1.004s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
dgraph-alpha1-1  | I0322 20:52:58.387178      12 log.go:34] [0] [E] LOG Compact 0->6 (5, 1 -> 1 tables with 1 splits). [00046 00047 00048 00049 00051 . 00045 .] -> [00052 .], took 2.778s
dgraph-alpha1-1  | I0322 20:53:13.911955      12 log.go:34] 3 is starting a new election at term 4
dgraph-alpha1-1  | I0322 20:53:13.912157      12 log.go:34] 3 became pre-candidate at term 4
dgraph-alpha1-1  | I0322 20:53:13.922116      12 log.go:34] 3 received MsgPreVoteResp from 3 at term 4
dgraph-alpha1-1  | I0322 20:53:13.926920      12 log.go:34] 3 [logterm: 4, index: 197] sent MsgPreVote request to 1 at term 4
dgraph-alpha1-1  | I0322 20:53:13.927050      12 log.go:34] 3 [logterm: 4, index: 197] sent MsgPreVote request to 2 at term 4
dgraph-alpha1-1  | I0322 20:53:13.927145      12 log.go:34] raft.node: 3 lost leader 1 at term 4
dgraph-alpha3-1  | I0322 20:53:13.911984      13 groups.go:1021] Zero leadership changed. Renewing oracle delta stream.
dgraph-alpha3-1  | E0322 20:53:13.925799      13 node.go:517] Error while calling IsPeer rpc error: code = DeadlineExceeded desc = context deadline exceeded. Reporting 3 as unreachable.
dgraph-alpha3-1  | W0322 20:53:13.926299      13 draft.go:1781] Raft.Ready took too long to process: Timer Total: 3.865s. Breakdown: [{disk 0s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
dgraph-alpha3-1  | E0322 20:53:13.930818      13 node.go:517] Error while calling IsPeer rpc error: code = DeadlineExceeded desc = context deadline exceeded. Reporting 2 as unreachable.
dgraph-alpha3-1  | E0322 20:53:13.931688      13 groups.go:997] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
dgraph-alpha3-1  | W0322 20:53:14.135077      13 node.go:422] Unable to send message to peer: 0x2. Error: while calling IsPeer 2: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph-alpha3-1  | W0322 20:53:14.136072      13 node.go:422] Unable to send message to peer: 0x3. Error: while calling IsPeer 3: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph-zero-1    | I0322 20:53:14.254650      14 zero.go:511] Got connection request: cluster_info_only:true 
dgraph-zero-1    | I0322 20:53:14.429961      14 zero.go:536] Connected: cluster_info_only:true 
dgraph-alpha3-1  | I0322 20:53:14.451064      13 groups.go:953] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
dgraph-alpha3-1  | I0322 20:53:14.459814      13 groups.go:965] Got Zero leader: zero:5080
dgraph-alpha3-1  | I0322 20:53:14.647648      13 log.go:34] 1 [logterm: 4, index: 197, vote: 1] cast MsgPreVote for 3 [logterm: 4, index: 197] at term 4
dgraph-alpha1-1  | I0322 20:53:15.381403      12 log.go:34] 3 became follower at term 4
dgraph-alpha1-1  | I0322 20:53:15.381489      12 log.go:34] raft.node: 3 elected leader 1 at term 4
dgraph-alpha2-1  | I0322 20:53:15.377043      13 log.go:34] 2 [logterm: 4, index: 198, vote: 1] rejected MsgPreVote from 3 [logterm: 4, index: 197] at term 4

The problem seems to go away if I remove alpha2 and alpha3 and the --replicas=3 from the zero.

Hmm this may actually be a problem with my image. Works fine with ebdm/unigraph-dgraph:latest instead of comake/dgraph:latest. Thought my image was the same but I guess not.

I guess it’s just a disk issue since it works with only one alpha.

Any one know how to debug disk usage?

Facing the same issue… But with only 1 Alpha and 1 Zero, both on the same machine. They loose connections and are unavailable for some time.

This one here brings down our whole infrastructure multiple times per day. Please help!

Unfortunately, Dgraph seems dead? @Dgraph_Admin

@adlerfaulkner Interested in this too! Did you find anything?

Look at my disk usage:

W0517 14:54:23.891146 17 draft.go:1781] Raft.Ready took too long to process: Timer Total: 34.896s. Breakdown: [{disk 34.896s}

I’ve got a nvme SSD. Not sure whats the issue here.