What version of Dgraph are you using?
Dgraph Version
$ dgraph version
[Decoder]: Using assembly version of decoder
Page Size: 4096
Dgraph version : v20.11.0
Dgraph codename : tchalla
Dgraph SHA-256 : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1 : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch : HEAD
Go version : go1.15.5
jemalloc enabled : true
Have you tried reproducing the issue with the latest release?
no
What is the hardware spec (RAM, OS)?
kubernetes(gke) 3 alphas, 1 group - 11core, 30GB ram each, 512Gi ssds.
Steps to reproduce the issue (command/config used to run Dgraph).
Normal helm install from chart.
I have a cluster (1 group) with a RAFT member that was far behind the rest of the alpha group:
I found in the logs that the group leader is trying to send him snapshots constantly:
I0316 20:48:47.277307 19 log.go:34] Sending Snapshot Sent data of size 18 GiB
I0316 20:48:47.792421 19 snapshot.go:252] Streaming done. Waiting for ACK...
I0316 20:48:49.096330 19 snapshot.go:257] Received ACK with done: true
I0316 20:48:49.096360 19 snapshot.go:298] Stream snapshot: OK
I0316 20:48:50.153753 19 snapshot.go:292] Got StreamSnapshot request: context:<id:2 group:1 addr:"ADDR_REDACTED:7080" > index:94158883 read_ts:9278223
I0316 20:48:50.170605 19 snapshot.go:201] Waiting to reach timestamp: 92782239
I0316 20:48:50.573906 19 log.go:34] Sending Snapshot Sending batch of size: 17 MB.
...(and a whole new snapshot is sent)
and from the receiver’s side:
I0316 20:48:49.095144 19 snapshot.go:119] Snapshot writes DONE. Sending ACK
I0316 20:48:49.095286 19 snapshot.go:126] Populated snapshot with data size: 18 GiB
I0316 20:48:49.109647 19 draft.go:1175] ---> Retrieve snapshot: OK.
I0316 20:48:49.109785 19 draft.go:1181] ---> SNAPSHOT: {Context:id:1 group:1 addr:"ADDR_REDACTED:7080" Index:94135563 ReadTs:92753749 Done:false Sinc
I0316 20:48:49.109988 19 draft.go:124] Operation completed with id: opSnapshot
W0316 20:48:49.110080 19 draft.go:1313] Raft.Ready took too long to process: Timer Total: 10m55.373s. Breakdown: [{disk 10m55.373s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I0316 20:48:50.146996 19 log.go:34] 2 [commit: 94135563, lastindex: 94135563, lastterm: 101] starts to restore snapshot [index: 94158883, term: 101]
I0316 20:48:50.147263 19 log.go:34] log [committed=94135563, applied=94135563, unstable.offset=94135564, len(unstable.Entries)=0] starts to restore snapshot [index: 94158883, term: 101]
I0316 20:48:50.147419 19 log.go:34] 2 restored progress of 1 [next = 94158884, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0316 20:48:50.147465 19 log.go:34] 2 restored progress of 2 [next = 94158884, match = 94158883, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0316 20:48:50.147474 19 log.go:34] 2 restored progress of 3 [next = 94158884, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0316 20:48:50.147500 19 log.go:34] 2 [commit: 94158883] restored snapshot [index: 94158883, term: 101]
I0316 20:48:50.147671 19 draft.go:1151] Drain applyCh by reaching 93890631 before retrieving snapshot
I0316 20:48:50.147693 19 draft.go:1048] Drained 0 proposals
I0316 20:48:50.147727 19 draft.go:1169] ---> SNAPSHOT: {Context:id:1 group:1 addr:"ADDR_REDACTED:7080" Index:94158883 ReadTs:92782239 Done:false Sinc
I0316 20:48:50.147784 19 draft.go:180] Operation started with id: opSnapshot
This has been happening constantly since just after the divergence in the above image. The only warning message is Raft.Ready took too long to process
but that doesnt seem to be anything but a log message triggered by the snapshot taking more than 500ms. Can anyone explain to me the semantics behind this? Even better would be advice on how to successfully converge my cluster.