Error on zero startup: "ZeroProposal: illegal tag 0 (wire type 0)"

Report a Dgraph Bug

Not sure if this is a bug or user-error, but I accidentally used the latest image tag on my 3-node kubernetes deployment. It upgraded from v20.07.2 to v20.11.0 yesterday and ran into errors with version number 7 not 8. I understand that’s because of storage compatibility, this isn’t the issue I’m reporting.

I’ve set both alpha and zero replicas down to 0, downgraded the StatefulSets back to v20.07.2 and then scaled my zero’s StatefulSet back to 3 replicas. Now I get an error on the first zero, here are the logs:

 Dgraph version   : v20.07.2
 Dgraph codename  : shuri-2
 Dgraph SHA-256   : a927845127dab735c24727d5a24af411168771b55236aec50f0b987e8c0ac910
 Commit SHA-1     : a7bc16d56
 Commit timestamp : 2020-10-22 10:17:53 -0700
 Branch           : HEAD
 Go version       : go1.14.4
 For Dgraph official documentation, visit https://dgraph.io/docs/.
 For discussions about Dgraph     , visit https://discuss.dgraph.io.
 To say hi to the community       , visit https://dgraph.slack.com.
 Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
 Copyright 2015-2020 Dgraph Labs, Inc.
 I0123 21:51:45.158537      19 run.go:127] Setting up grpc listener at: 0.0.0.0:5080
 I0123 21:51:45.159438      19 run.go:127] Setting up http listener at: 0.0.0.0:6080
 I0123 21:51:45.160751      19 run.go:298] Opening zero BadgerDB with options: {Dir:zw ValueDir:zw SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0xc000
 badger 2021/01/23 21:51:45 INFO: All 1 tables opened in 5ms
 badger 2021/01/23 21:51:45 INFO: Replaying file id: 6 at offset: 42785522
 badger 2021/01/23 21:51:45 INFO: Replay took: 4.921µs
 I0123 21:51:45.208055      19 node.go:149] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc000112050 Applied:1444061 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864
 I0123 21:51:45.208368      19 node.go:307] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:1444061 Term:3 XXX_unrecognized:[]}
 I0123 21:51:45.208468      19 node.go:318] Found hardstate: {Term:7 Vote:0 Commit:1445344 XXX_unrecognized:[]}
 I0123 21:51:45.209336      19 node.go:327] Group 0 found 1283 entries
 I0123 21:51:45.209362      19 raft.go:515] Restarting node for dgraphzero
 I0123 21:51:45.209427      19 node.go:186] Setting conf state to nodes:1 nodes:2 nodes:3
 I0123 21:51:45.209593      19 pool.go:160] CONNECTING to dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080
 I0123 21:51:45.209619      19 pool.go:160] CONNECTING to dgraph-alpha-1.dgraph-alpha.default.svc.cluster.local:7080
 I0123 21:51:45.209642      19 pool.go:160] CONNECTING to dgraph-alpha-2.dgraph-alpha.default.svc.cluster.local:7080
 I0123 21:51:45.209665      19 pool.go:160] CONNECTING to dgraph-zero-1.dgraph-zero.default.svc.cluster.local:5080
 I0123 21:51:45.209724      19 pool.go:160] CONNECTING to dgraph-zero-2.dgraph-zero.default.svc.cluster.local:5080
 I0123 21:51:45.210211      19 log.go:34] 1 became follower at term 7
 I0123 21:51:45.210517      19 log.go:34] newRaft 1 [peers: [1,2,3], term: 7, commit: 1445344, applied: 1444061, lastindex: 1445344, lastterm: 7]
 [Sentry] 2021/01/23 21:51:45 Sending fatal event [1118306852724b18a70f595e9e64d2cc] to o318308.ingest.sentry.io project: 1805390
 2021/01/23 21:51:45 proto: ZeroProposal: illegal tag 0 (wire type 0)
 github.com/dgraph-io/dgraph/x.Check
     /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
 github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
     /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:324
 github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
     /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:79
 github.com/spf13/cobra.(*Command).execute
     /go/pkg/mod/github.com/spf13/[email protected]/command.go:830
 github.com/spf13/cobra.(*Command).ExecuteC
     /go/pkg/mod/github.com/spf13/[email protected]/command.go:914
 github.com/spf13/cobra.(*Command).Execute
     /go/pkg/mod/github.com/spf13/[email protected]/command.go:864
 github.com/dgraph-io/dgraph/dgraph/cmd.Execute
     /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70
 main.main
     /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78
 runtime.main
     /usr/local/go/src/runtime/proc.go:203
 runtime.goexit
     /usr/local/go/src/runtime/asm_amd64.s:1373

What is the hardware spec (RAM, OS)?

Each zero/alpha is on a 4CPU 16GB node

Expected behaviour and actual result.

I can downgrade back to v20.07

Also, my mistake of using latest has happened on 5 different clusters, and following the steps of downgrading as described above has worked on 4 of them, just 1 is getting the error.

Hello, I have the same problem. do you fix it ?

Hey, this is happening because the raftwal of 20.07 and 20.11 are not compatible. Upgrading a major version should be done via backup/restore process.

In your case this might have happened. The cluster had zeros on different versions at the same time. So zero at v20.11 would have propagated the ZeroProposal to zero at v20.07. Hence, when reading the proposal we saw this error because of incompatibilty of the raftwal across versions.

Soo. Today we’re also facing this issue in our production environment:

I1103 11:02:51.312368      19 run.go:254] Setting Config to: {raft:0xc000012108 telemetry:0xc0000120d0 limit:0xc000012110 bindall:true portOffset:3 numReplicas:1 peer: w:/data/zw rebalanceInterval:480000000000 tlsClientConfig:<nil> audit:<nil> limiterConfig:0xc0005a91d0}
I1103 11:02:51.312430      19 run.go:144] Setting up grpc listener at: 0.0.0.0:5083
I1103 11:02:51.312700      19 run.go:144] Setting up http listener at: 0.0.0.0:6083
I1103 11:02:51.312868      19 log.go:295] Found file: 50 First Index: 1470001
I1103 11:02:51.312931      19 storage.go:125] Init Raft Storage with snap: 1489167, first: 1489168, last: 1489245
I1103 11:02:51.313031      19 node.go:152] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc000172880 Applied:1489167 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x33e3080 DisableProposalForwarding:false}
I1103 11:02:51.313559      19 node.go:310] Found Snapshot.Metadata: {ConfState:{Nodes:[1] Learners:[] XXX_unrecognized:[]} Index:1489167 Term:13 XXX_unrecognized:[]}
I1103 11:02:51.313592      19 node.go:321] Found hardstate: {Term:13 Vote:1 Commit:1489245 XXX_unrecognized:[]}
I1103 11:02:51.323965      19 node.go:326] Group 0 found 19245 entries
I1103 11:02:51.324006      19 raft.go:605] Restarting node for dgraphzero
I1103 11:02:51.324024      19 node.go:189] Setting conf state to nodes:1 
I1103 11:02:51.324469      19 pool.go:162] CONNECTING to 10.5.1.3:7083
I1103 11:02:51.324512      19 log.go:34] 1 became follower at term 13
I1103 11:02:51.324520      19 log.go:34] newRaft 1 [peers: [1], term: 13, commit: 1489245, applied: 1489167, lastindex: 1489245, lastterm: 13]
2021/11/03 11:02:51 proto: ZeroProposal: illegal tag 0 (wire type 0)
github.com/dgraph-io/dgraph/x.Check
    /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
    /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:338
github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
    /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:83
github.com/spf13/cobra.(*Command).execute
    /go/pkg/mod/github.com/spf13/[email protected]/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
    /go/pkg/mod/github.com/spf13/[email protected]/command.go:914
github.com/spf13/cobra.(*Command).Execute
    /go/pkg/mod/github.com/spf13/[email protected]/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
    /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
    /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
    /usr/local/go/src/runtime/proc.go:225
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1371

We’re using dgraph alpha and zero via docker images with tag 21.03.1.

During deployment we’re restarting these docker containers.

We did not change versions during the whole lifetime of our production server.

@Naman help please! How can we export/backup our data and/or make zero functional again?

1 Like

Hey @maaft, not sure why this might have happened as you have already mentioned that Dgraph version was kept the same 21.03.1 throughout the lifetime.

One way to fix it would be to recreate a cluster from the existing p directory. Basically, you would have to copy over the p directory. Bump the UID, timestamps on zero, and namespace ID (if you are using multi-tenancy) https://dgraph.io/docs/deploy/dgraph-zero/#endpoints. Then you can enable the traffic on the cluster.

cc: @dmai

Could you please elaborate how to do this? Step by step instructions would be really helpful.

You can run the following call on zero for bumping the uids by . Similarly for others.

curl "http://localhost:6080/assign?what=uids&num=<max leased uid>"

If you know by any chance, know the output of the latest /state endpoint’s output. If your alpha is still running, then you can query /state and get the maxUID, maxTxnTs, maxNsId from there.
Else, we would have to get those values through a custom script by iterating over the DB.

Pardon if I don’t understand correctly, but alpha doesn’t seem to have a /state endpoint.

Zero is of course down, due to the error reported above.

Can you please give me an example how I can iterate over the DB to get those values?

I still have my zw folder for zero (with *.wal and wal.meta). Can I extract the IDs somehow by introspecting these files? If yes, what tools do I need to use?