Panic: tocommit(289) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?

We have lot of events coming to sentry with following stacktrace:

panic: tocommit(289) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?

goroutine 177 [running]:
log.Panicf(0x198155c, 0x5d, 0xc0005c6f60, 0x2, 0x2)
/usr/local/go/src/log/log.go:358 +0xc0
http://github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x260a270, 0x198155c, 0x5d, 0xc0005c6f60, 0x2, 0x2)
/tmp/go/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
http://go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0000d6150, 0x121)
/tmp/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x131
http://go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0000bc000, 0x8, 0x2, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
http://go.etcd.io/etcd/raft.stepFollower(0xc0000bc000, 0x8, 0x2, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x459
http://go.etcd.io/etcd/raft.(*raft).Step(0xc0000bc000, 0x8, 0x2, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

I was able to reproduce it in following way:

  • Run docker cluster with 1Z, 3A nodes(docker-compose.yml file generated using compose tool).
  • Run live loader with 1M data set.
  • Delete w directory of any follower.
  • Stop the follower (whose w directory has been deleted) using docker stop command.
  • Start the same node using docker start command.

This follower crashes immediately. Full logs are here:

alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: ContextifyFrames
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: Environment
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: Modules
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: IgnoreErrors
alpha2    | [Decoder]: Using assembly version of decoder
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: ContextifyFrames
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: Environment
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: Modules
alpha2    | [Sentry] 2020/04/21 16:10:55 Integration installed: IgnoreErrors
alpha2    | I0421 16:10:55.768913      16 init.go:99] 
alpha2    | 
alpha2    | Dgraph version   : v20.03.0
alpha2    | Dgraph SHA-256   : 2e27c7655f26025ded8b481bb1c25d86b50fc01a34f77006fa8531e8ac8c0951
alpha2    | Commit SHA-1     : 147c8df9d
alpha2    | Commit timestamp : 2020-03-30 17:28:31 -0700
alpha2    | Branch           : HEAD
alpha2    | Go version       : go1.14.1
alpha2    | 
alpha2    | For Dgraph official documentation, visit https://docs.dgraph.io.
alpha2    | For discussions about Dgraph     , visit https://discuss.dgraph.io.
alpha2    | To say hi to the community       , visit https://dgraph.slack.com.
alpha2    | 
alpha2    | Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
alpha2    | Copyright 2015-2020 Dgraph Labs, Inc.
alpha2    | 
alpha2    | 
alpha2    | I0421 16:10:55.768934      16 run.go:606] x.Config: {PortOffset:102 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
alpha2    | I0421 16:10:55.768951      16 run.go:607] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr:alpha2:7182 ZeroAddr:zero1:5180 RaftId:2 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-04-21 16:10:55.438493208 +0000 UTC m=+0.007668250 LudicrousMode:false}
alpha2    | I0421 16:10:55.768978      16 run.go:608] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: WALDir:w MutationsMode:0 AuthToken: AllottedMemory:1024 HmacSecret:[] AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
alpha2    | I0421 16:10:55.771004      16 server_state.go:74] Setting Badger table load option: mmap
alpha2    | I0421 16:10:55.771019      16 server_state.go:86] Setting Badger value log load option: mmap
alpha2    | I0421 16:10:55.771029      16 server_state.go:131] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x280a270 Compression:0 EventLogging:true InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:true NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:1 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
alpha2    | I0421 16:10:55.776914      16 log.go:34] All 0 tables opened in 0s
alpha2    | I0421 16:10:55.778836      16 server_state.go:74] Setting Badger table load option: mmap
alpha2    | I0421 16:10:55.778844      16 server_state.go:86] Setting Badger value log load option: mmap
alpha2    | I0421 16:10:55.778848      16 server_state.go:154] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x280a270 Compression:0 EventLogging:true InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:true NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:1 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
alpha2    | I0421 16:10:55.786771      16 log.go:34] All 1 tables opened in 0s
alpha2    | I0421 16:10:55.787615      16 log.go:34] Replaying file id: 0 at offset: 927
alpha2    | I0421 16:10:58.278243      16 log.go:34] Replay took: 2.490608726s
alpha2    | I0421 16:10:58.278440      16 log.go:34] Replaying file id: 1 at offset: 0
alpha2    | I0421 16:11:00.534367      16 log.go:34] Replay took: 2.255911723s
alpha2    | I0421 16:11:00.534463      16 log.go:34] Replaying file id: 2 at offset: 0
alpha2    | I0421 16:11:01.449896      16 log.go:34] Replay took: 915.419148ms
alpha2    | I0421 16:11:01.492524      16 groups.go:104] Current Raft Id: 0x2
alpha2    | I0421 16:11:01.492547      16 groups.go:627] No healthy Zero leader found. Trying to find a Zero leader...
alpha2    | I0421 16:11:01.492849      16 worker.go:96] Worker listening at address: [::]:7182
alpha2    | I0421 16:11:01.493626      16 run.go:477] Bringing up GraphQL HTTP API at 0.0.0.0:8182/graphql
alpha2    | I0421 16:11:01.493641      16 run.go:478] Bringing up GraphQL HTTP admin API at 0.0.0.0:8182/admin
alpha2    | I0421 16:11:01.493663      16 run.go:509] gRPC server started.  Listening on port 9182
alpha2    | I0421 16:11:01.493673      16 run.go:510] HTTP server started.  Listening on port 8182
alpha2    | I0421 16:11:01.493696      16 server.go:99] Starting telemetry data collection for alpha...
alpha2    | I0421 16:11:01.611482      16 pool.go:160] CONNECTING to zero1:5180
zero1     | I0421 16:11:01.612912      16 zero.go:417] Got connection request: cluster_info_only:true 
zero1     | I0421 16:11:01.613145      16 zero.go:435] Connected: cluster_info_only:true 
alpha2    | I0421 16:11:01.613761      16 groups.go:664] Found connection to leader: zero1:5180
zero1     | I0421 16:11:01.614001      16 zero.go:417] Got connection request: id:2 addr:"alpha2:7182" 
zero1     | I0421 16:11:01.614205      16 zero.go:544] Connected: id:2 addr:"alpha2:7182" 
alpha2    | I0421 16:11:01.614478      16 groups.go:131] Connected to group zero. Assigned group: 0
alpha2    | I0421 16:11:01.614487      16 groups.go:133] Raft Id after connection to Zero: 0x2
alpha2    | I0421 16:11:01.614520      16 pool.go:160] CONNECTING to alpha3:7183
alpha2    | I0421 16:11:01.614546      16 pool.go:160] CONNECTING to alpha1:7180
alpha2    | E0421 16:11:01.614759      16 storage.go:264] While seekEntry. Error: Unable to find raft entry
alpha2    | I0421 16:11:01.614780      16 pool.go:160] CONNECTING to alpha2:7182
alpha2    | I0421 16:11:01.615016      16 draft.go:175] Node ID: 0x2 with GroupID: 1
alpha2    | I0421 16:11:01.615068      16 node.go:145] Setting raft.Config to: &{ID:2 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc02ba7f5f0 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x280a270 DisableProposalForwarding:false}
alpha2    | I0421 16:11:01.615561      16 node.go:323] Group 1 found 1 entries
alpha2    | I0421 16:11:01.615576      16 draft.go:1467] Calling IsPeer
alpha2    | I0421 16:11:01.616056      16 draft.go:1472] Done with IsPeer call
alpha2    | I0421 16:11:01.616064      16 draft.go:1506] Restarting node for group: 1
alpha2    | I0421 16:11:01.616087      16 storage.go:261] Setting first index: 1
alpha2    | I0421 16:11:01.616117      16 log.go:34] 2 became follower at term 0
alpha2    | I0421 16:11:01.616161      16 log.go:34] newRaft 2 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
alpha2    | I0421 16:11:01.616188      16 draft.go:1526] Restart node complete
alpha2    | I0421 16:11:01.616200      16 draft.go:136] Operation started with id: opRollup
alpha2    | I0421 16:11:01.616220      16 groups.go:151] Server is ready
alpha2    | I0421 16:11:01.616303      16 draft.go:925] Found Raft progress: 0
alpha2    | I0421 16:11:01.616304      16 groups.go:775] Got address of a Zero leader: zero1:5180
alpha2    | I0421 16:11:01.616347      16 groups.go:788] Starting a new membership stream receive from zero1:5180.
alpha2    | I0421 16:11:01.617050      16 groups.go:805] Received first state update from Zero: counter:1136 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"alpha1:7180" > > members:<key:2 value:<id:2 group_id:1 addr:"alpha2:7182" > > members:<key:3 value:<id:3 group_id:1 addr:"alpha3:7183" leader:true last_update:1587485022 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" > > tablets:<key:"country" value:<group_id:1 predicate:"country" > > tablets:<key:"cut.note" value:<group_id:1 predicate:"cut.note" > > tablets:<key:"dgraph.graphql.schema" value:<group_id:1 predicate:"dgraph.graphql.schema" > > tablets:<key:"dgraph.type" value:<group_id:1 predicate:"dgraph.type" > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" > > tablets:<key:"email" value:<group_id:1 predicate:"email" > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" > > tablets:<key:"loc" value:<group_id:1 predicate:"loc" > > tablets:<key:"name" value:<group_id:1 predicate:"name" > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" > > tablets:<key:"performance.character_note" value:<group_id:1 predicate:"performance.character_note" > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" > > tablets:<key:"rated" value:<group_id:1 predicate:"rated" > > tablets:<key:"rating" value:<group_id:1 predicate:"rating" > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" > > tablets:<key:"tagline" value:<group_id:1 predicate:"tagline" > > checksum:3938978339060400535 > > zeros:<key:1 value:<id:1 addr:"zero1:5180" leader:true > > maxLeaseId:18446055125927780484 maxTxnTs:10000 cid:"e238b93b-2cce-4a01-9559-3d97fd0fa929" license:<maxNodes:18446744073709551615 expiryTs:1590077024 enabled:true > 
alpha3    | W0421 16:11:01.916891      14 node.go:417] Unable to send message to peer: 0x2. Error: Unhealthy connection
alpha2    | I0421 16:11:02.616526      16 groups.go:183] Done informing Zero about the 19 tablets I have
alpha2    | I0421 16:11:02.616641      16 groups.go:216] Skipping initial schema upsert for predicate dgraph.type
alpha2    | I0421 16:11:02.616691      16 groups.go:216] Skipping initial schema upsert for predicate dgraph.graphql.schema
alpha2    | I0421 16:11:02.918568      16 log.go:34] 2 [term: 0] received a MsgHeartbeat message with higher term from 3 [term: 2]
alpha2    | I0421 16:11:02.918618      16 log.go:34] 2 became follower at term 2
alpha2    | 2020/04/21 16:11:02 tocommit(1626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
alpha2    | panic: tocommit(1626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
alpha2    | 
alpha2    | goroutine 310 [running]:
alpha2    | log.Panicf(0x197d97c, 0x5d, 0xc000447320, 0x2, 0x2)
alpha2    | 	/home/ashish/go/src/log/log.go:358 +0xc0
alpha2    | github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x280a270, 0x197d97c, 0x5d, 0xc000447320, 0x2, 0x2)
alpha2    | 	/home/ashish/projects/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
alpha2    | go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc000162150, 0x65a)
alpha2    | 	/home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x131
alpha2    | go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc032fe6000, 0x8, 0x2, 0x3, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
alpha2    | 	/home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
alpha2    | go.etcd.io/etcd/raft.stepFollower(0xc032fe6000, 0x8, 0x2, 0x3, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
alpha2    | 	/home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x459
alpha2    | go.etcd.io/etcd/raft.(*raft).Step(0xc032fe6000, 0x8, 0x2, 0x3, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
alpha2    | 	/home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:971 +0x138d
alpha2    | go.etcd.io/etcd/raft.(*node).run(0xc056eae780, 0xc032fe6000)
alpha2    | 	/home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:357 +0x10c8
alpha2    | created by go.etcd.io/etcd/raft.RestartNode
alpha2    | 	/home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:246 +0x31b
alpha2    | [Sentry] 2020/04/21 16:11:02 ModuleIntegration wasn't able to extract modules: module integration failed
alpha2    | [Sentry] 2020/04/21 16:11:02 Sending fatal event [996089576f5d485a86176f0e8c11debe] to sentry.io project: 1805390
alpha2    | [Sentry] 2020/04/21 16:11:04 Buffer flushed successfully.
alpha2 exited with code 1

Reason for crash:

  • After restart, follower initialises its w directory and other raft configurations with their default values(last index=0, commit index=0 etc).
  • Its joins the same alpha group as follower after getting MsgHeartBeat from another follower.
  • Leader sends periodic MsgHeartBeat to it with commit index(last commit index value before deleting w directory) .
  • Follower checks that commit index is larger than its last index(0). This means raft logs have been truncated or lost and follower panics.

@LGalatin, @Shekar one thing @mrjn suggested, was that we should document all of these frequently occurring errors in some document for our users. Something like we have done in Badger(https://github.com/dgraph-io/badger#frequently-asked-questions)

1 Like

There is an off-by-one error here. lastIndex is zero in your reproduction.

This is not the same as the issue that was reported, where lastIndex was 1. The question would be: How do you end up in a situation where your lastIndex is 1?