We have lot of events coming to sentry with following stacktrace:
panic: tocommit(289) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?
goroutine 177 [running]:
log.Panicf(0x198155c, 0x5d, 0xc0005c6f60, 0x2, 0x2)
/usr/local/go/src/log/log.go:358 +0xc0
http://github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x260a270, 0x198155c, 0x5d, 0xc0005c6f60, 0x2, 0x2)
/tmp/go/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
http://go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0000d6150, 0x121)
/tmp/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x131
http://go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0000bc000, 0x8, 0x2, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
http://go.etcd.io/etcd/raft.stepFollower(0xc0000bc000, 0x8, 0x2, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x459
http://go.etcd.io/etcd/raft.(*raft).Step(0xc0000bc000, 0x8, 0x2, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
I was able to reproduce it in following way:
- Run docker cluster with 1Z, 3A nodes(docker-compose.yml file generated using compose tool).
- Run live loader with 1M data set.
- Delete
w
directory of any follower. - Stop the follower (whose
w
directory has been deleted) usingdocker stop
command. - Start the same node using
docker start
command.
This follower crashes immediately. Full logs are here:
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: ContextifyFrames
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: Environment
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: Modules
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: IgnoreErrors
alpha2 | [Decoder]: Using assembly version of decoder
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: ContextifyFrames
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: Environment
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: Modules
alpha2 | [Sentry] 2020/04/21 16:10:55 Integration installed: IgnoreErrors
alpha2 | I0421 16:10:55.768913 16 init.go:99]
alpha2 |
alpha2 | Dgraph version : v20.03.0
alpha2 | Dgraph SHA-256 : 2e27c7655f26025ded8b481bb1c25d86b50fc01a34f77006fa8531e8ac8c0951
alpha2 | Commit SHA-1 : 147c8df9d
alpha2 | Commit timestamp : 2020-03-30 17:28:31 -0700
alpha2 | Branch : HEAD
alpha2 | Go version : go1.14.1
alpha2 |
alpha2 | For Dgraph official documentation, visit https://docs.dgraph.io.
alpha2 | For discussions about Dgraph , visit http://discuss.dgraph.io.
alpha2 | To say hi to the community , visit https://dgraph.slack.com.
alpha2 |
alpha2 | Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
alpha2 | Copyright 2015-2020 Dgraph Labs, Inc.
alpha2 |
alpha2 |
alpha2 | I0421 16:10:55.768934 16 run.go:606] x.Config: {PortOffset:102 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
alpha2 | I0421 16:10:55.768951 16 run.go:607] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr:alpha2:7182 ZeroAddr:zero1:5180 RaftId:2 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-04-21 16:10:55.438493208 +0000 UTC m=+0.007668250 LudicrousMode:false}
alpha2 | I0421 16:10:55.768978 16 run.go:608] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: WALDir:w MutationsMode:0 AuthToken: AllottedMemory:1024 HmacSecret:[] AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
alpha2 | I0421 16:10:55.771004 16 server_state.go:74] Setting Badger table load option: mmap
alpha2 | I0421 16:10:55.771019 16 server_state.go:86] Setting Badger value log load option: mmap
alpha2 | I0421 16:10:55.771029 16 server_state.go:131] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x280a270 Compression:0 EventLogging:true InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:true NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:1 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
alpha2 | I0421 16:10:55.776914 16 log.go:34] All 0 tables opened in 0s
alpha2 | I0421 16:10:55.778836 16 server_state.go:74] Setting Badger table load option: mmap
alpha2 | I0421 16:10:55.778844 16 server_state.go:86] Setting Badger value log load option: mmap
alpha2 | I0421 16:10:55.778848 16 server_state.go:154] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x280a270 Compression:0 EventLogging:true InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:true NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:1 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
alpha2 | I0421 16:10:55.786771 16 log.go:34] All 1 tables opened in 0s
alpha2 | I0421 16:10:55.787615 16 log.go:34] Replaying file id: 0 at offset: 927
alpha2 | I0421 16:10:58.278243 16 log.go:34] Replay took: 2.490608726s
alpha2 | I0421 16:10:58.278440 16 log.go:34] Replaying file id: 1 at offset: 0
alpha2 | I0421 16:11:00.534367 16 log.go:34] Replay took: 2.255911723s
alpha2 | I0421 16:11:00.534463 16 log.go:34] Replaying file id: 2 at offset: 0
alpha2 | I0421 16:11:01.449896 16 log.go:34] Replay took: 915.419148ms
alpha2 | I0421 16:11:01.492524 16 groups.go:104] Current Raft Id: 0x2
alpha2 | I0421 16:11:01.492547 16 groups.go:627] No healthy Zero leader found. Trying to find a Zero leader...
alpha2 | I0421 16:11:01.492849 16 worker.go:96] Worker listening at address: [::]:7182
alpha2 | I0421 16:11:01.493626 16 run.go:477] Bringing up GraphQL HTTP API at 0.0.0.0:8182/graphql
alpha2 | I0421 16:11:01.493641 16 run.go:478] Bringing up GraphQL HTTP admin API at 0.0.0.0:8182/admin
alpha2 | I0421 16:11:01.493663 16 run.go:509] gRPC server started. Listening on port 9182
alpha2 | I0421 16:11:01.493673 16 run.go:510] HTTP server started. Listening on port 8182
alpha2 | I0421 16:11:01.493696 16 server.go:99] Starting telemetry data collection for alpha...
alpha2 | I0421 16:11:01.611482 16 pool.go:160] CONNECTING to zero1:5180
zero1 | I0421 16:11:01.612912 16 zero.go:417] Got connection request: cluster_info_only:true
zero1 | I0421 16:11:01.613145 16 zero.go:435] Connected: cluster_info_only:true
alpha2 | I0421 16:11:01.613761 16 groups.go:664] Found connection to leader: zero1:5180
zero1 | I0421 16:11:01.614001 16 zero.go:417] Got connection request: id:2 addr:"alpha2:7182"
zero1 | I0421 16:11:01.614205 16 zero.go:544] Connected: id:2 addr:"alpha2:7182"
alpha2 | I0421 16:11:01.614478 16 groups.go:131] Connected to group zero. Assigned group: 0
alpha2 | I0421 16:11:01.614487 16 groups.go:133] Raft Id after connection to Zero: 0x2
alpha2 | I0421 16:11:01.614520 16 pool.go:160] CONNECTING to alpha3:7183
alpha2 | I0421 16:11:01.614546 16 pool.go:160] CONNECTING to alpha1:7180
alpha2 | E0421 16:11:01.614759 16 storage.go:264] While seekEntry. Error: Unable to find raft entry
alpha2 | I0421 16:11:01.614780 16 pool.go:160] CONNECTING to alpha2:7182
alpha2 | I0421 16:11:01.615016 16 draft.go:175] Node ID: 0x2 with GroupID: 1
alpha2 | I0421 16:11:01.615068 16 node.go:145] Setting raft.Config to: &{ID:2 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc02ba7f5f0 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x280a270 DisableProposalForwarding:false}
alpha2 | I0421 16:11:01.615561 16 node.go:323] Group 1 found 1 entries
alpha2 | I0421 16:11:01.615576 16 draft.go:1467] Calling IsPeer
alpha2 | I0421 16:11:01.616056 16 draft.go:1472] Done with IsPeer call
alpha2 | I0421 16:11:01.616064 16 draft.go:1506] Restarting node for group: 1
alpha2 | I0421 16:11:01.616087 16 storage.go:261] Setting first index: 1
alpha2 | I0421 16:11:01.616117 16 log.go:34] 2 became follower at term 0
alpha2 | I0421 16:11:01.616161 16 log.go:34] newRaft 2 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
alpha2 | I0421 16:11:01.616188 16 draft.go:1526] Restart node complete
alpha2 | I0421 16:11:01.616200 16 draft.go:136] Operation started with id: opRollup
alpha2 | I0421 16:11:01.616220 16 groups.go:151] Server is ready
alpha2 | I0421 16:11:01.616303 16 draft.go:925] Found Raft progress: 0
alpha2 | I0421 16:11:01.616304 16 groups.go:775] Got address of a Zero leader: zero1:5180
alpha2 | I0421 16:11:01.616347 16 groups.go:788] Starting a new membership stream receive from zero1:5180.
alpha2 | I0421 16:11:01.617050 16 groups.go:805] Received first state update from Zero: counter:1136 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"alpha1:7180" > > members:<key:2 value:<id:2 group_id:1 addr:"alpha2:7182" > > members:<key:3 value:<id:3 group_id:1 addr:"alpha3:7183" leader:true last_update:1587485022 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" > > tablets:<key:"country" value:<group_id:1 predicate:"country" > > tablets:<key:"cut.note" value:<group_id:1 predicate:"cut.note" > > tablets:<key:"dgraph.graphql.schema" value:<group_id:1 predicate:"dgraph.graphql.schema" > > tablets:<key:"dgraph.type" value:<group_id:1 predicate:"dgraph.type" > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" > > tablets:<key:"email" value:<group_id:1 predicate:"email" > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" > > tablets:<key:"loc" value:<group_id:1 predicate:"loc" > > tablets:<key:"name" value:<group_id:1 predicate:"name" > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" > > tablets:<key:"performance.character_note" value:<group_id:1 predicate:"performance.character_note" > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" > > tablets:<key:"rated" value:<group_id:1 predicate:"rated" > > tablets:<key:"rating" value:<group_id:1 predicate:"rating" > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" > > tablets:<key:"tagline" value:<group_id:1 predicate:"tagline" > > checksum:3938978339060400535 > > zeros:<key:1 value:<id:1 addr:"zero1:5180" leader:true > > maxLeaseId:18446055125927780484 maxTxnTs:10000 cid:"e238b93b-2cce-4a01-9559-3d97fd0fa929" license:<maxNodes:18446744073709551615 expiryTs:1590077024 enabled:true >
alpha3 | W0421 16:11:01.916891 14 node.go:417] Unable to send message to peer: 0x2. Error: Unhealthy connection
alpha2 | I0421 16:11:02.616526 16 groups.go:183] Done informing Zero about the 19 tablets I have
alpha2 | I0421 16:11:02.616641 16 groups.go:216] Skipping initial schema upsert for predicate dgraph.type
alpha2 | I0421 16:11:02.616691 16 groups.go:216] Skipping initial schema upsert for predicate dgraph.graphql.schema
alpha2 | I0421 16:11:02.918568 16 log.go:34] 2 [term: 0] received a MsgHeartbeat message with higher term from 3 [term: 2]
alpha2 | I0421 16:11:02.918618 16 log.go:34] 2 became follower at term 2
alpha2 | 2020/04/21 16:11:02 tocommit(1626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
alpha2 | panic: tocommit(1626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
alpha2 |
alpha2 | goroutine 310 [running]:
alpha2 | log.Panicf(0x197d97c, 0x5d, 0xc000447320, 0x2, 0x2)
alpha2 | /home/ashish/go/src/log/log.go:358 +0xc0
alpha2 | github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x280a270, 0x197d97c, 0x5d, 0xc000447320, 0x2, 0x2)
alpha2 | /home/ashish/projects/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
alpha2 | go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc000162150, 0x65a)
alpha2 | /home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x131
alpha2 | go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc032fe6000, 0x8, 0x2, 0x3, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
alpha2 | /home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
alpha2 | go.etcd.io/etcd/raft.stepFollower(0xc032fe6000, 0x8, 0x2, 0x3, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
alpha2 | /home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x459
alpha2 | go.etcd.io/etcd/raft.(*raft).Step(0xc032fe6000, 0x8, 0x2, 0x3, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
alpha2 | /home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:971 +0x138d
alpha2 | go.etcd.io/etcd/raft.(*node).run(0xc056eae780, 0xc032fe6000)
alpha2 | /home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:357 +0x10c8
alpha2 | created by go.etcd.io/etcd/raft.RestartNode
alpha2 | /home/ashish/projects/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:246 +0x31b
alpha2 | [Sentry] 2020/04/21 16:11:02 ModuleIntegration wasn't able to extract modules: module integration failed
alpha2 | [Sentry] 2020/04/21 16:11:02 Sending fatal event [996089576f5d485a86176f0e8c11debe] to sentry.io project: 1805390
alpha2 | [Sentry] 2020/04/21 16:11:04 Buffer flushed successfully.
alpha2 exited with code 1
Reason for crash:
- After restart, follower initialises its
w
directory and other raft configurations with their default values(last index=0, commit index=0 etc). - Its joins the same alpha group as follower after getting
MsgHeartBeat
from another follower. - Leader sends periodic
MsgHeartBeat
to it with commit index(last commit index value before deletingw
directory) . - Follower checks that commit index is larger than its last index(0). This means raft logs have been truncated or lost and follower panics.
@LGalatin, @Shekar one thing @mrjn suggested, was that we should document all of these frequently occurring errors in some document for our users. Something like we have done in Badger(GitHub - dgraph-io/badger: Fast key-value DB in Go.)