Alpha group diverged, continually sends snapshots

iluminae · March 17, 2021, 4:06am

What version of Dgraph are you using?

Dgraph Version

$ dgraph version
 
[Decoder]: Using assembly version of decoder
Page Size: 4096

Dgraph version   : v20.11.0
Dgraph codename  : tchalla
Dgraph SHA-256   : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1     : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

no

What is the hardware spec (RAM, OS)?

kubernetes(gke) 3 alphas, 1 group - 11core, 30GB ram each, 512Gi ssds.

Steps to reproduce the issue (command/config used to run Dgraph).

Normal helm install from chart.

I have a cluster (1 group) with a RAFT member that was far behind the rest of the alpha group:

I found in the logs that the group leader is trying to send him snapshots constantly:

I0316 20:48:47.277307      19 log.go:34] Sending Snapshot Sent data of size 18 GiB
I0316 20:48:47.792421      19 snapshot.go:252] Streaming done. Waiting for ACK...
I0316 20:48:49.096330      19 snapshot.go:257] Received ACK with done: true
I0316 20:48:49.096360      19 snapshot.go:298] Stream snapshot: OK
I0316 20:48:50.153753      19 snapshot.go:292] Got StreamSnapshot request: context:<id:2 group:1 addr:"ADDR_REDACTED:7080" > index:94158883 read_ts:9278223
I0316 20:48:50.170605      19 snapshot.go:201] Waiting to reach timestamp: 92782239
I0316 20:48:50.573906      19 log.go:34] Sending Snapshot Sending batch of size: 17 MB.
...(and a whole new snapshot is sent)

and from the receiver’s side:

I0316 20:48:49.095144      19 snapshot.go:119] Snapshot writes DONE. Sending ACK
I0316 20:48:49.095286      19 snapshot.go:126] Populated snapshot with data size: 18 GiB
I0316 20:48:49.109647      19 draft.go:1175] ---> Retrieve snapshot: OK.
I0316 20:48:49.109785      19 draft.go:1181] ---> SNAPSHOT: {Context:id:1 group:1 addr:"ADDR_REDACTED:7080"  Index:94135563 ReadTs:92753749 Done:false Sinc
I0316 20:48:49.109988      19 draft.go:124] Operation completed with id: opSnapshot
W0316 20:48:49.110080      19 draft.go:1313] Raft.Ready took too long to process: Timer Total: 10m55.373s. Breakdown: [{disk 10m55.373s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I0316 20:48:50.146996      19 log.go:34] 2 [commit: 94135563, lastindex: 94135563, lastterm: 101] starts to restore snapshot [index: 94158883, term: 101]
I0316 20:48:50.147263      19 log.go:34] log [committed=94135563, applied=94135563, unstable.offset=94135564, len(unstable.Entries)=0] starts to restore snapshot [index: 94158883, term: 101]
I0316 20:48:50.147419      19 log.go:34] 2 restored progress of 1 [next = 94158884, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0316 20:48:50.147465      19 log.go:34] 2 restored progress of 2 [next = 94158884, match = 94158883, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0316 20:48:50.147474      19 log.go:34] 2 restored progress of 3 [next = 94158884, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0316 20:48:50.147500      19 log.go:34] 2 [commit: 94158883] restored snapshot [index: 94158883, term: 101]
I0316 20:48:50.147671      19 draft.go:1151] Drain applyCh by reaching 93890631 before retrieving snapshot
I0316 20:48:50.147693      19 draft.go:1048] Drained 0 proposals
I0316 20:48:50.147727      19 draft.go:1169] ---> SNAPSHOT: {Context:id:1 group:1 addr:"ADDR_REDACTED:7080"  Index:94158883 ReadTs:92782239 Done:false Sinc
I0316 20:48:50.147784      19 draft.go:180] Operation started with id: opSnapshot

This has been happening constantly since just after the divergence in the above image. The only warning message is Raft.Ready took too long to process but that doesnt seem to be anything but a log message triggered by the snapshot taking more than 500ms. Can anyone explain to me the semantics behind this? Even better would be advice on how to successfully converge my cluster.

chewxy · March 17, 2021, 7:11pm

what happens when you kill alpha-1 and restart it?

iluminae · March 17, 2021, 7:19pm

as it is shutting down I get alot of this in the logs for alpha-1:

W0317 19:18:33.926056      19 raft_server.go:239] Error while raft.Step from 0x1: raft: stopped. Closing RaftMessage stream.

I assume the snapshot will continue to fail after restart, I’ll let you know.

iluminae · March 17, 2021, 8:45pm

um, kinda weird - the logs completely stop after table loads (and membership message from zero) with this:

...membership message from zero, its rreeeallly long....
I0317 20:38:27.218024      19 pool.go:162] CONNECTING to ADDR_REDACTED:7080
I0317 20:38:27.453693      19 groups.go:159] Server is ready
I0317 20:38:27.453804      19 access_ee.go:389] ResetAcl closed
I0317 20:38:27.453813      19 access_ee.go:311] RefreshAcls closed
I0317 20:38:28.025124      19 log.go:34] raft.node: 2 elected leader 1 at term 101
...end of logs, it doesnt say anything else

see here the prom stats are not even being polled for it:

(if you curl the prom stats that raft applied index does not even show up - like the metric has not yet been registered)

I dont see either of the other peers rolling up a snapshot for it either.

iluminae · March 18, 2021, 5:02pm

That node seems to be back to getting a constant stream of snapshots.

I0318 16:32:45.590937      19 draft.go:1169] ---> SNAPSHOT: {Context:id:1 group:1 addr:"ADDR_REDACTED:7080"  Index:97963566 ReadTs:98317853 Done:false SinceTs:98269910 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}. Group 1 from node id 0x1
I0318 16:32:45.591033      19 draft.go:180] Operation started with id: opSnapshot
I0318 16:32:47.537808      19 log.go:34] [2] [E] LOG Compact 5->6 (1, 1 -> 1 tables with 1 splits). [2370752 . 2370797 .] -> [2370804 .], took 2.266s
I0318 16:32:47.665330      19 log.go:34] [1] [E] LOG Compact 4->5 (1, 1 -> 1 tables with 1 splits). [2370759 . 2370722 .] -> [2370803 .], took 7.502s
I0318 16:32:49.909059      19 log.go:34] [2] [E] LOG Compact 5->6 (1, 1 -> 1 tables with 1 splits). [2370745 . 2370804 .] -> [2370812 .], took 2.371s
I0318 16:32:52.123233      19 log.go:34] [2] [E] LOG Compact 5->6 (1, 1 -> 1 tables with 1 splits). [2370602 . 2370812 .] -> [2370821 .], took 2.214s
I0318 16:32:52.681479      19 log.go:34] [1] [E] LOG Compact 5->6 (1, 1 -> 2 tables with 1 splits). [2364164 . 2349888 .] -> [2370823 2370824 .], took 2.139s
--
--
I0318 16:46:05.381449      19 log.go:34] [1] [E] LOG Compact 5->6 (1, 4 -> 4 tables with 2 splits). [2379826 . 2381349 2381198 2320302 2318146 .] -> [2381405 2381409 2381406 2381407 .], took 2.191s
I0318 16:46:07.902136      19 log.go:34] [3] [E] LOG Compact 4->5 (1, 1 -> 1 tables with 1 splits). [2381393 . 2381372 .] -> [2381445 .], took 4.263s
I0318 16:46:08.312027      19 log.go:34] [2] [E] LOG Compact 2->3 (1, 5 -> 6 tables with 2 splits). [2381408 . 2381425 2381367 2381369 2381365 2381368 .] -> [2381430 2381434 2381436 2381431 2381438 2381446 .], took 2.219s
I0318 16:46:10.033231      19 log.go:34] [0] [E] LOG Compact 5->6 (1, 11 -> 6 tables with 4 splits). [2375833 . 2368990 2368991 2320314 2318150 2297071 2285123 2285096 2278902 2236704 2197393 2378678 .] -> [2381455 2381456 2381441 2381440 2381453 2381454 .], took 3.23s
I0318 16:46:11.125481      19 log.go:34] [2] [E] LOG Compact 3->4 (1, 0 -> 1 tables with 1 splits). [2381446 . .] -> [2381471 .], took 2.813s
I0318 16:46:12.138428      19 snapshot.go:119] Snapshot writes DONE. Sending ACK
I0318 16:46:12.138550      19 snapshot.go:126] Populated snapshot with data size: 22 GiB
I0318 16:46:12.155827      19 draft.go:1175] ---> Retrieve snapshot: OK.
I0318 16:46:12.155874      19 draft.go:1181] ---> SNAPSHOT: {Context:id:1 group:1 addr:"ADDR_REDACTED:7080"  Index:97963566 ReadTs:98317853 Done:false SinceTs:98269910 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}. Group 1. DONE.
I0318 16:46:12.155978      19 draft.go:124] Operation completed with id: opSnapshot
W0318 16:46:12.156021      19 draft.go:1313] Raft.Ready took too long to process: Timer Total: 13m26.565s. Breakdown: [{disk 13m26.565s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I0318 16:46:12.931363      19 log.go:34] [0] [E] LOG Compact 0->2 (5, 0 -> 5 tables with 1 splits). [2381396 2381404 2381428 2381444 2381452 . .] -> [2381459 2381461 2381464 2381470 2381481 .], took 2.898s
--
W0318 16:46:12.156021      19 draft.go:1313] Raft.Ready took too long to process: Timer Total: 13m26.565s. Breakdown: [{disk 13m26.565s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I0318 16:46:12.931363      19 log.go:34] [0] [E] LOG Compact 0->2 (5, 0 -> 5 tables with 1 splits). [2381396 2381404 2381428 2381444 2381452 . .] -> [2381459 2381461 2381464 2381470 2381481 .], took 2.898s
I0318 16:46:13.229421      19 log.go:34] 2 [commit: 97963566, lastindex: 97963566, lastterm: 103] starts to restore snapshot [index: 97995088, term: 103]
I0318 16:46:13.229447      19 log.go:34] log [committed=97963566, applied=97963566, unstable.offset=97963567, len(unstable.Entries)=0] starts to restore snapshot [index: 97995088, term: 103]
I0318 16:46:13.229470      19 log.go:34] 2 restored progress of 1 [next = 97995089, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0318 16:46:13.229479      19 log.go:34] 2 restored progress of 2 [next = 97995089, match = 97995088, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0318 16:46:13.229484      19 log.go:34] 2 restored progress of 3 [next = 97995089, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I0318 16:46:13.229491      19 log.go:34] 2 [commit: 97995088] restored snapshot [index: 97995088, term: 103]
I0318 16:46:13.229522      19 draft.go:1151] Drain applyCh by reaching 0 before retrieving snapshot
I0318 16:46:13.229527      19 draft.go:1048] Drained 0 proposals
--
I0318 16:46:13.229527      19 draft.go:1048] Drained 0 proposals

anything else I can do here or is that peer in a bad state indefinitely? If I were on hardware I would remove the node and add him back with a new raft idx… but in kubernetes it uses the statefulset ordinal as the raft idx and man that would cause headaches if I removed one and it were blacklisted…

chewxy · March 18, 2021, 9:19pm

@ibrahim I’m a bit lost here. Can you help?

ibrahim · March 19, 2021, 11:35am

It seems like you’re getting a snapshot and while the current snapshot is being received your leader keeps making progress and has to send the next snapshot as soon as the first one finishes.

@dmai what can we do here to prevent snapshots from being sent continuously?

iluminae · March 22, 2021, 2:51pm

TL;DR

it fixed itself over the weekend on alpha-1 and now alpha-0 is getting non-stop streams from the master.

Some very confusing stuff happened over the weekend - I checked the prom stats this morning and the peer that was stuck behind.. well check out his applied index stat over time:

It caught up alot, but didnt apply anything else for 7h and then caught up again (with snapshot im sure) and now seems to be caught up.

However… now a different server is stuck at one raft index: (looking at all 3 over the same timespan)

the leader (alpha-2) has now sent 33 snapshots to alpha-0 in the last 6h.

mrjn · March 22, 2021, 9:59pm

The way to fix this is to increase the snapshot interval. The flag as I see on master is --raft="snapshot-after=10000". You can increase this to 100K, or 200K or something. Basically, what you want is that the time it takes to send a snapshot < time it takes to create a new snapshot (by leader).

iluminae · March 22, 2021, 10:01pm

ah gotcha, I will set that on our cluster. Logs say it takes 10m-15m to send that, ill try setting it to an hour or so

iluminae · March 23, 2021, 3:23pm

thanks @mrjn - I calculated a raft number that took roughly 1h to get through and the diverged server caught right up. I landed on 190k as the value - 19x the default. This seems like an important value to tune for a cluster that gets inserts constantly, glad I know now.

mrjn · March 24, 2021, 12:46am

We’ll be changing this from a number based flag to a time duration based flag. That ways, it would be easier to think about this.

iluminae · March 24, 2021, 12:48am

perfect - I actually was prepared to put in a duration and realized last minute it was a raft id, so that would feel better.

Topic		Replies	Views
Alpha Raft.Ready took too long to process Dgraph kind:bug	4	1061	May 19, 2022
One alpha node begin to dropAll when there are some normal mutations Users mutation , kind:bug	10	859	February 17, 2019
After the alpha leader's data is corrupted, the node cannot join the cluster as a new node Dgraph kind:question	2	646	May 10, 2021
What's the point is to put this message every few milliseconds? Dgraph	19	1221	February 6, 2019
Help needed: dgraph keeps crashing under load - what s going on? Dgraph	4	1867	April 13, 2020