Dgraph nodes not starting properly on v20.11-rc3

Report a Dgraph Bug

What version of Dgraph are you using?

v20.11-rc3

Have you tried reproducing the issue with the latest release?

Attempting to run with the latest release (v20.11-rc3) but the nodes are unable to communicate with one another. This is not an issue when the same nodes are run using v20.11-rc1

What is the hardware spec (RAM, OS)?

3 dgraph alpha nodes each with 512 GB of memory and 1 zero node with 128 GB of memory running CentOS 7. Dgraph nodes are running in docker containers and all containers are attached to a network. Communication between nodes on the docker network is not an issue when run using v20.11.0-rc1-130-gfab88c093.

Steps to reproduce the issue (command/config used to run Dgraph).

A dataset was created using the bulk-loader (v20.11.0-rc1-130-gfab88c093) and the p, w, and t directories were distributed to the alpha nodes and the zw directory to the zero node.

Expected behaviour and actual result.

Expected dgraph alpha and zero nodes running dgraph v20.11-rc3 to start properly and for alpha and zero nodes to begin communicating normally.

Whenever the v20.11-rc3 nodes are started the alpha nodes repeatedly output errors such as:

E1209 15:58:37.617404      20 run.go:772] Error while retrieving cors origins: : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection
E1209 15:58:38.592411      20 groups.go:1142] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": error from client.subscribe: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.1.29:7081: connect: connection refused". closer err: <nil>
E1209 15:58:38.596195      20 groups.go:1142] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": error from client.subscribe: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.1.29:7081: connect: connection refused". closer err: <nil>
I1209 15:58:38.600729      20 admin.go:683] Error reading GraphQL schema: : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection.
E1209 15:58:38.618941      20 run.go:772] Error while retrieving cors origins: : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection```

Hi

this looks like a connection error. Can you make sure that all the instances of Alpha and Zero are shut down before you try running them again?

Hey chewxy,

Thank you for the reply.

I have made sure all Alpha and Zero instances are shut down before bringing them back up. I have just confirmed on another dataset that after performing a bulk load with v20.11.0-rc1-130-gfab88c093 I am unable to successfully start Alpha and Zero nodes running v20.11-rc3 and they are logging the same errors.

After a little more investigation it seems the problem is actually with Zero. The error logged in zero is:

I1210 16:36:29.715793       1 node.go:189] Setting conf state to nodes:1
2020/12/10 16:36:29 proto: wrong wireType = 0 for field Groups
github.com/dgraph-io/dgraph/x.Check
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*node).initAndStartNode
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/raft.go:525
github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:254
github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:75
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:71
main.main
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:102
runtime.main
	/usr/local/go/src/runtime/proc.go:204
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1374

I deleted the zw directory and tried to run v20.11-rc3 again and it was able to successfully start all Dgraph Zero and Alpha instances and the data was available.

I also tried testing this with v20.11.0-rc1-164-g7f16bf14 and observed the same behavior. The nodes were unable to start when provided the zw directory created during the bulk load but worked properly if I deleted the zw directory and allowed Zero to start fresh.

It seems the zw directory produced by bulk loading using earlier v20.11 release candidates are incompatible with the newer release candidates.

Looks like we might have updated the protobuf structs in between versions.

@ibrahim thoughts?

There was an issue on master which was fixed by

@rmshivers42 can you please try the latest master from a clean state (delete your p, w, zw, etc) directories.