Moved from GitHub dgraph/3466
Posted by danielmai:
What version of Dgraph are you using?
v1.0.15-rc9
Have you tried reproducing the issue with latest release?
No, not with v1.0.14 or master.
What is the hardware spec (RAM, OS)?
Amazon Linux 4.14.114-105.126.amzn2.x86_64, 8 cores, 64 GB.
Steps to reproduce the issue (command/config used to run Dgraph)
Raft WAL got into a bad state with a missing entry. I’m not sure how to reproduce this.
Actual result
panic: applied(8517361) is out of range [prevApplied(8516979), committed(8517360)]
goroutine 275 [running]:
log.Panicf(0x1584c78, 0x3c, 0xc1297889c0, 0x3, 0x3)
/usr/local/go/src/log/log.go:340 +0xc0
github.com/dgraph-io/dgraph/conn.(*ToGlog).Panicf(0x1f847f0, 0x1584c78, 0x3c, 0xc1297889c0, 0
x3, 0x3)
/tmp/go/src/github.com/dgraph-io/dgraph/conn/node.go:94 +0x53
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*raftLog).appliedTo(0xc00022e690, 0x
81f6f1)
/tmp/go/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/log.go:214 +0x131
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*node).run(0xc0002404e0, 0xc0000f678
0)
/tmp/go/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:418 +0x5f
a
created by github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.StartNode
/tmp/go/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:233 +0x40
8
When inspecting the w directory using dgraph debug -w=./w
, I see the following:
Snapshot Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:8465196 Term:8150 XXX_unrecognized:[]}
Snapshot Alpha: {Context:id:3 group:1 addr:"127.0.0.1:7080" Index:8465196 ReadTs:15144868 Done:false SinceTs:0 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
Hardstate: {Term:8190 Vote:3 Commit:8517360 XXX_unrecognized:[]}
Checkpoint: 8516149
Last Index: 8517361 . Num Entries: 52164 .
...
8169 . 8517358 . EntryNormal . 70 B . Delta . Max: 15217709 . 15217707 → 15217708 . Pending txns: 0 .
8169 . 8517359 . EntryNormal . 182 kB . Mutation . StartTs: 15217709 . Edges: 3620 . Pending txns: 1 .
8169 . 8517361 . EntryNormal . 69 B . Delta . Max: 15217711 . 15217709 → 15217710 . Pending txns: 0 .
There are a couple of anomalies I see in the dgraph debug
output:
- The entry with index
8517360
is missing, but it’s set as committed in the HardState. - The HardState mentions term 8190 while the latest proposals 8517359 and 8517361 are from term 8169.
Every Alpha restart afterwards consistently leads to this error:
panic: Key not found
goroutine 3846 [running]:
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*raftLog).slice(0xc000192b60, 0x81f6f0, 0x81f6f1, 0x4000000, 0xc0ad04a000, 0x18a, 0x238, 0x0, 0xc00c196948)
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/log.go:322 +0x6d3
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*raftLog).nextEnts(0xc000192b60, 0x60, 0x60, 0xc354d18600)
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/log.go:154 +0x79
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.newReady(0xc306ff2280, 0xc337ca6190, 0x1ffe, 0x3, 0x81f6f0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:583 +0x8b
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*node).run(0xc386271a40, 0xc306ff2280)
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:319 +0x1664
created by github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.RestartNode
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:246 +0x31b
Workaround
One workaround for this issue is to use removeNode for this Alpha and create a new one after the other Alphas have created a fresh snapshot. The new w entries after the snapshot look correct: the term election makes sense and there aren’t any missing Raft entries in the log.