Missing Raft index assertion failure

Moved from GitHub dgraph/3466

Posted by danielmai:

What version of Dgraph are you using?

v1.0.15-rc9

Have you tried reproducing the issue with latest release?

No, not with v1.0.14 or master.

What is the hardware spec (RAM, OS)?

Amazon Linux 4.14.114-105.126.amzn2.x86_64, 8 cores, 64 GB.

Steps to reproduce the issue (command/config used to run Dgraph)

Raft WAL got into a bad state with a missing entry. I’m not sure how to reproduce this.

Actual result

panic: applied(8517361) is out of range [prevApplied(8516979), committed(8517360)]

goroutine 275 [running]:
log.Panicf(0x1584c78, 0x3c, 0xc1297889c0, 0x3, 0x3)
        /usr/local/go/src/log/log.go:340 +0xc0
github.com/dgraph-io/dgraph/conn.(*ToGlog).Panicf(0x1f847f0, 0x1584c78, 0x3c, 0xc1297889c0, 0
x3, 0x3)
        /tmp/go/src/github.com/dgraph-io/dgraph/conn/node.go:94 +0x53
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*raftLog).appliedTo(0xc00022e690, 0x
81f6f1)
        /tmp/go/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/log.go:214 +0x131
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*node).run(0xc0002404e0, 0xc0000f678
0)
        /tmp/go/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:418 +0x5f
a
created by github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.StartNode
        /tmp/go/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:233 +0x40
8

When inspecting the w directory using dgraph debug -w=./w, I see the following:

Snapshot Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:8465196 Term:8150 XXX_unrecognized:[]}
Snapshot Alpha: {Context:id:3 group:1 addr:"127.0.0.1:7080"  Index:8465196 ReadTs:15144868 Done:false SinceTs:0 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Hardstate: {Term:8190 Vote:3 Commit:8517360 XXX_unrecognized:[]}
Checkpoint: 8516149
Last Index: 8517361 . Num Entries: 52164 .

...
8169 . 8517358 . EntryNormal . 70 B   . Delta . Max: 15217709 . 15217707 → 15217708 . Pending txns: 0 .
8169 . 8517359 . EntryNormal . 182 kB . Mutation . StartTs: 15217709 . Edges: 3620 . Pending txns: 1 .
8169 . 8517361 . EntryNormal . 69 B   . Delta . Max: 15217711 . 15217709 → 15217710 . Pending txns: 0 .

There are a couple of anomalies I see in the dgraph debug output:

  • The entry with index 8517360 is missing, but it’s set as committed in the HardState.
  • The HardState mentions term 8190 while the latest proposals 8517359 and 8517361 are from term 8169.

Every Alpha restart afterwards consistently leads to this error:

panic: Key not found

goroutine 3846 [running]:
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*raftLog).slice(0xc000192b60, 0x81f6f0, 0x81f6f1, 0x4000000, 0xc0ad04a000, 0x18a, 0x238, 0x0, 0xc00c196948)
    /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/log.go:322 +0x6d3
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*raftLog).nextEnts(0xc000192b60, 0x60, 0x60, 0xc354d18600)
    /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/log.go:154 +0x79
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.newReady(0xc306ff2280, 0xc337ca6190, 0x1ffe, 0x3, 0x81f6f0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:583 +0x8b
github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.(*node).run(0xc386271a40, 0xc306ff2280)
    /ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:319 +0x1664
created by github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft.RestartNode
/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/go.etcd.io/etcd/raft/node.go:246 +0x31b

Workaround

One workaround for this issue is to use removeNode for this Alpha and create a new one after the other Alphas have created a fresh snapshot. The new w entries after the snapshot look correct: the term election makes sense and there aren’t any missing Raft entries in the log.

campoy commented :

Hey @danielmai,

This looks like a pretty serious issue, unless it’s something we can’t reproduce on 1.1.0 nor 1.0.16.

sleto-it commented :

Hi all,

I was wondering if we have seen / reproduced this again in more recent versions

Many Thanks,