Where is the raft log located? "tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?"


Report a Dgraph Bug

I manually configured a 3 zero node cluster with --replica 3 flag. Then I stopped dgraph zero and dgraph alpha on all 3 nodes. I then manually removed the p w zw folders expecting thats all that would need to be done to be treated as a clean install and to be able to create a new cluster.

I then manually configured 1 zero node with the --replica 1 flag set instead of 3 on the first zero node I had setup dgraph on previously. This was after I had stopped dgraph on the machine and removed the folders. After about 10 seconds dgraph zero panics with this error “tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?”

Where is the raft log located and can it be removed to so that a new cluster can be configured without wiping the machines as there are other services running on the node.

I0812 21:17:49.951857   12813 run.go:108] Setting up grpc listener at: 0.0.0.0:5080
I0812 21:17:49.952200   12813 run.go:108] Setting up http listener at: 0.0.0.0:6080
badger 2020/08/12 21:17:49 INFO: All 0 tables opened in 0s
I0812 21:17:49.974360   12813 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00007a600 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2bcf318 DisableProposalForwarding:false}
I0812 21:17:49.974705   12813 node.go:326] Group 0 found 0 entries
I0812 21:17:49.974783   12813 log.go:34] 1 became follower at term 0
I0812 21:17:49.975065   12813 log.go:34] newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
I0812 21:17:49.975162   12813 log.go:34] 1 became follower at term 1
I0812 21:17:49.975691   12813 run.go:311] Running Dgraph Zero...
E0812 21:17:49.976176   12813 raft.go:518] While proposing CID: Not Zero leader. Aborting proposal: cid:"e0d6c305-4a4c-41a4-a69e-fd6f455534f7" . Retrying...
I0812 21:17:49.978315   12813 node.go:185] Setting conf state to nodes:1
I0812 21:17:49.978364   12813 raft.go:708] Done applying conf change at 0x1
I0812 21:17:50.975658   12813 log.go:34] 1 no leader at term 1; dropping index reading msg
I0812 21:17:52.002298   12813 pool.go:160] CONNECTING to my-old-second-zero-ip:5080
I0812 21:17:52.002852   12813 log.go:34] 1 [term: 1] received a MsgHeartbeat message with higher term from 3 [term: 3]
I0812 21:17:52.002965   12813 log.go:34] 1 became follower at term 3
2020/08/12 21:17:52 tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?

goroutine 185 [running]:
log.Panicf(0x1bef64a, 0x5d, 0xc0001f6380, 0x2, 0x2)
        /usr/local/go/src/log/log.go:358 +0xc0
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x2bcf318, 0x1bef64a, 0x5d, 0xc0001f6380, 0x2, 0x2)
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0000ee230, 0x15)
        /go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x131
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0000c6000, 0x8, 0x1, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc0000c6000, 0x8, 0x1, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x459
go.etcd.io/etcd/raft.(*raft).Step(0xc0000c6000, 0x8, 0x1, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:971 +0x138d
go.etcd.io/etcd/raft.(*node).run(0xc00052b080, 0xc0000c6000)
        /go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:357 +0x10c8
created by go.etcd.io/etcd/raft.StartNode
        /go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:233 +0x407
W0812 21:17:52.009653   12763 sentry_integration.go:140] unable to read CID from file /tmp/dgraph-zero-cid-sentry open /tmp/dgraph-zero-cid-sentry: no such file or directory. Skip
[Sentry] 2020/08/12 21:17:52 Sending fatal event [963e856c78da4d41a334a5bb08881fcc] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2020/08/12 21:17:52 Buffer flushed successfully.

The zero logs show the cluster trying to connect to an old zero node that is no longer running despite cleaning of the folders and specify a 1 zero node configuration.

What version of Dgraph are you using?

v20.07.0

Have you tried reproducing the issue with the latest release?

I have unable to fix my current situation so haven’t tried re-creating it yet.

What is the hardware spec (RAM, OS)?

2GB Ubuntu

Steps to reproduce the issue (command/config used to run Dgraph).

dgraph zero --replicas=3 --idx=1 --my=zero1-ip:5080 > zero.log 2>&1 &
dgraph zero --replicas=3 --idx=2 --my=zero2-ip:5080 --peer=zero1-ip:5080 > zero.log 2>&1 &
dgraph zero --replicas=3 --idx=3 --my=zero2-ip:5080 --peer=zero1-ip:5080 > zero.log 2>&1 &

dgraph alpha --lru_mb 1024 --my=zero1-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero2-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero3-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &

Used the kill command to manually kill the dgraph processes on each node.

removed p w zw folders
rm -r p/
rm -r w/
rm -r zw/

dgraph zero --replicas=1 --idx=1 --my=zero1-ip:5080 > zero.log 2>&1 &

Then dgraph zero panics.

Expected behaviour and actual result.

I expected that stopping all dgraph proccesses in the cluster and removing the p w zw folders would wipe any configuration like a clean install.

The actual result seems that there is an additional configuration still hanging around located somewhere else on the server possibly in the dgraph install location.

Hey @JASGames,

Thanks for sharing this.

When you run the following commands

dgraph zero --replicas=3 --idx=1 --my=zero1-ip:5080 > zero.log 2>&1 &
dgraph zero --replicas=3 --idx=2 --my=zero2-ip:5080 --peer=zero1-ip:5080 > zero.log 2>&1 &
dgraph zero --replicas=3 --idx=3 --my=zero2-ip:5080 --peer=zero1-ip:5080 > zero.log 2>&1 &

dgraph alpha --lru_mb 1024 --my=zero1-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero2-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero3-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &

only 1 alpha and 1 zero will start. To start multiple nodes you’ll need to provide offset as shown here.

Still starting a zero in the end should have worked. Can you please share the command that you’re using to kill the instances of alphas and zeros?

Sorry I didn’t make it specifically clear but each of the dgraph zero and dgraph alpha commands were run on seperate machines not on a single machine. I killed the processes of dgraph using the linux kill PID command so if the running dgraph zero process id on a machine was 1234 I would have run kill 1234

You can see in the error message above I0812 21:17:52.002298 12813 pool.go:160] CONNECTING to my-old-second-zero-ip:5080 which suggests that etcd or some config somewhere is expecting my old cluster nodes to be running even tho after killing all process across all three machines and removing the p w and zw folders. I got the error trying to run a single node cluster on the first machine with this command dgraph zero --replicas=1 --idx=1 --my=zero1-ip:5080 > zero.log 2>&1 &

A ran the above command after I had kill all the processes across all three machines and removed all the folders.

Hope that makes sense :sweat_smile: