Dgraph Zero crashes with Fatal error along with infinite loop in Alpha

Hi. I use Dgraph within my GKE cluster deployed via Helm. All 3 zeroes crashed where these logs were visible when looking at the zeroes. I don’t know what triggered these errors (all I was doing is some admin operations like updating schema, enable/disable logging, query health of db, etc.):

I0125 16:06:02.427026      18 run.go:185] Setting Config to: {bindall:true portOffset:0 nodeId:1 numReplicas:5 peer: w:zw rebalanceInterval:480000000000 tlsClientConfig:<nil>}
I0125 16:06:02.427081      18 run.go:98] Setting up grpc listener at: 0.0.0.0:5080
I0125 16:06:02.428113      18 run.go:98] Setting up http listener at: 0.0.0.0:6080
I0125 16:06:02.429155      18 log.go:295] Found file: 1 First Index: 1
I0125 16:06:02.429492      18 storage.go:132] Init Raft Storage with snap: 166, first: 167, last: 173
I0125 16:06:02.469765      18 node.go:152] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc000606320 Applied:166 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2e00cb8 DisableProposalForwarding:false}
[Sentry] 2021/01/25 16:06:02 Sending fatal event [1ac1142b7d8f44b8b8026cb497d8859f] to o318308.ingest.sentry.io project: 5208688
I0125 16:06:02.474900      18 node.go:310] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:166 Term:2 XXX_unrecognized:[]}
I0125 16:06:02.475149      18 node.go:321] Found hardstate: {Term:4 Vote:1 Commit:173 XXX_unrecognized:[]}
I0125 16:06:02.475345      18 node.go:326] Group 0 found 173 entries
I0125 16:06:02.475356      18 raft.go:542] Restarting node for dgraphzero
I0125 16:06:02.475373      18 node.go:189] Setting conf state to nodes:1 nodes:2 nodes:3
2021/01/25 16:06:02 proto: wrong wireType = 0 for field Groups
github.com/dgraph-io/dgraph/x.Check
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*node).initAndStartNode
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/raft.go:550
github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:254
github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:75
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:71
main.main
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:102
runtime.main
/usr/local/go/src/runtime/proc.go:204
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1374

And since all 3 zeroes crashed, all alpha nodes show these logs in an infinite loop when trying to reconnect again and again:

I0125 08:12:43.888397      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 3
I0125 08:12:43.888487      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 2 at term 3
I0125 08:12:43.888537      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 3 at term 3
I0125 08:12:47.688260      17 log.go:34] 1 is starting a new election at term 3
I0125 08:12:47.688300      17 log.go:34] 1 became pre-candidate at term 3
I0125 08:12:47.688309      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 3
I0125 08:12:47.688342      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 2 at term 3
I0125 08:12:47.688354      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 3 at term 3
W0125 08:12:48.688573      17 node.go:420] Unable to send message to peer: 0x2. Error: Unhealthy connection
I0125 08:12:51.488153      17 log.go:34] 1 is starting a new election at term 3
I0125 08:12:51.488206      17 log.go:34] 1 became pre-candidate at term 3
I0125 08:12:51.488217      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 3
I0125 08:12:51.488235      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 2 at term 3
I0125 08:12:51.488247      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 3 at term 3
W0125 08:12:52.488776      17 node.go:420] Unable to send message to peer: 0x3. Error: Unhealthy connection
I0125 08:12:55.288135      17 log.go:34] 1 is starting a new election at term 3
I0125 08:12:55.288209      17 log.go:34] 1 became pre-candidate at term 3
I0125 08:12:55.288220      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 3
I0125 08:12:55.288236      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 2 at term 3
I0125 08:12:55.288248      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 3 at term 3
I0125 08:12:59.088173      17 log.go:34] 1 is starting a new election at term 3
I0125 08:12:59.088277      17 log.go:34] 1 became pre-candidate at term 3
I0125 08:12:59.088286      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 3
I0125 08:12:59.088302      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 2 at term 3
I0125 08:12:59.088314      17 log.go:34] 1 [logterm: 3, index: 229] sent MsgPreVote request to 3 at term 3
W0125 08:13:00.088562      17 node.go:420] Unable to send message to peer: 0x2. Error: Unhealthy connection

None of the Dgraph operations work after this error. Currently, I am recreating Dgraph instances after destroying all the volumes to temporarily work around this.

Faced the same issue again today with Dgraph crashing with these logs.

[Sentry] 2021/02/13 11:53:31 Sending fatal event [249b8b06c99545fda20fb5f23012bcd5] to o318308.ingest.sentry.io project: 5208688
2021/02/13 11:53:31 proto: wrong wireType = 0 for field Groups
github.com/dgraph-io/dgraph/x.Check
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*node).initAndStartNode
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/raft.go:550
github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:254
github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:75
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:71
main.main
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:102
runtime.main
/usr/local/go/src/runtime/proc.go:204
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1374

@ibrahim mighty ou have an idea what is the case?

@tvvignesh what are the versions of Dgraph that you are using

Hey, @tvvignesh, are you using a different version of v20.11 (rcs and release or any commit) over the same set of directories? This generally happens when there is a difference in the proto file versions across the commits. Can you explain a bit more about your setup?

1 Like