Report a Dgraph Bug
I manually configured a 3 zero node cluster with --replica 3 flag. Then I stopped dgraph zero and dgraph alpha on all 3 nodes. I then manually removed the p w zw folders expecting thats all that would need to be done to be treated as a clean install and to be able to create a new cluster.
I then manually configured 1 zero node with the --replica 1 flag set instead of 3 on the first zero node I had setup dgraph on previously. This was after I had stopped dgraph on the machine and removed the folders. After about 10 seconds dgraph zero panics with this error “tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?”
Where is the raft log located and can it be removed to so that a new cluster can be configured without wiping the machines as there are other services running on the node.
I0812 21:17:49.951857 12813 run.go:108] Setting up grpc listener at: 0.0.0.0:5080
I0812 21:17:49.952200 12813 run.go:108] Setting up http listener at: 0.0.0.0:6080
badger 2020/08/12 21:17:49 INFO: All 0 tables opened in 0s
I0812 21:17:49.974360 12813 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00007a600 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2bcf318 DisableProposalForwarding:false}
I0812 21:17:49.974705 12813 node.go:326] Group 0 found 0 entries
I0812 21:17:49.974783 12813 log.go:34] 1 became follower at term 0
I0812 21:17:49.975065 12813 log.go:34] newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
I0812 21:17:49.975162 12813 log.go:34] 1 became follower at term 1
I0812 21:17:49.975691 12813 run.go:311] Running Dgraph Zero...
E0812 21:17:49.976176 12813 raft.go:518] While proposing CID: Not Zero leader. Aborting proposal: cid:"e0d6c305-4a4c-41a4-a69e-fd6f455534f7" . Retrying...
I0812 21:17:49.978315 12813 node.go:185] Setting conf state to nodes:1
I0812 21:17:49.978364 12813 raft.go:708] Done applying conf change at 0x1
I0812 21:17:50.975658 12813 log.go:34] 1 no leader at term 1; dropping index reading msg
I0812 21:17:52.002298 12813 pool.go:160] CONNECTING to my-old-second-zero-ip:5080
I0812 21:17:52.002852 12813 log.go:34] 1 [term: 1] received a MsgHeartbeat message with higher term from 3 [term: 3]
I0812 21:17:52.002965 12813 log.go:34] 1 became follower at term 3
2020/08/12 21:17:52 tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?
goroutine 185 [running]:
log.Panicf(0x1bef64a, 0x5d, 0xc0001f6380, 0x2, 0x2)
/usr/local/go/src/log/log.go:358 +0xc0
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x2bcf318, 0x1bef64a, 0x5d, 0xc0001f6380, 0x2, 0x2)
/ext-go/1/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc0000ee230, 0x15)
/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x131
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0000c6000, 0x8, 0x1, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc0000c6000, 0x8, 0x1, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x459
go.etcd.io/etcd/raft.(*raft).Step(0xc0000c6000, 0x8, 0x1, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:971 +0x138d
go.etcd.io/etcd/raft.(*node).run(0xc00052b080, 0xc0000c6000)
/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:357 +0x10c8
created by go.etcd.io/etcd/raft.StartNode
/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:233 +0x407
W0812 21:17:52.009653 12763 sentry_integration.go:140] unable to read CID from file /tmp/dgraph-zero-cid-sentry open /tmp/dgraph-zero-cid-sentry: no such file or directory. Skip
[Sentry] 2020/08/12 21:17:52 Sending fatal event [963e856c78da4d41a334a5bb08881fcc] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2020/08/12 21:17:52 Buffer flushed successfully.
The zero logs show the cluster trying to connect to an old zero node that is no longer running despite cleaning of the folders and specify a 1 zero node configuration.
What version of Dgraph are you using?
v20.07.0
Have you tried reproducing the issue with the latest release?
I have unable to fix my current situation so haven’t tried re-creating it yet.
What is the hardware spec (RAM, OS)?
2GB Ubuntu
Steps to reproduce the issue (command/config used to run Dgraph).
dgraph zero --replicas=3 --idx=1 --my=zero1-ip:5080 > zero.log 2>&1 &
dgraph zero --replicas=3 --idx=2 --my=zero2-ip:5080 --peer=zero1-ip:5080 > zero.log 2>&1 &
dgraph zero --replicas=3 --idx=3 --my=zero2-ip:5080 --peer=zero1-ip:5080 > zero.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero1-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero2-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
dgraph alpha --lru_mb 1024 --my=zero3-ip:7080 --zero=zero1-ip:5080 > alpha.log 2>&1 &
Used the kill command to manually kill the dgraph processes on each node.
removed p w zw folders
rm -r p/
rm -r w/
rm -r zw/
dgraph zero --replicas=1 --idx=1 --my=zero1-ip:5080 > zero.log 2>&1 &
Then dgraph zero panics.
Expected behaviour and actual result.
I expected that stopping all dgraph proccesses in the cluster and removing the p w zw folders would wipe any configuration like a clean install.
The actual result seems that there is an additional configuration still hanging around located somewhere else on the server possibly in the dgraph install location.