Dgraph nodes not starting properly on v20.11-rc3

rmshivers42 · December 9, 2020, 4:37pm

Report a Dgraph Bug

What version of Dgraph are you using?

v20.11-rc3

Have you tried reproducing the issue with the latest release?

Attempting to run with the latest release (v20.11-rc3) but the nodes are unable to communicate with one another. This is not an issue when the same nodes are run using v20.11-rc1

What is the hardware spec (RAM, OS)?

3 dgraph alpha nodes each with 512 GB of memory and 1 zero node with 128 GB of memory running CentOS 7. Dgraph nodes are running in docker containers and all containers are attached to a network. Communication between nodes on the docker network is not an issue when run using v20.11.0-rc1-130-gfab88c093.

Steps to reproduce the issue (command/config used to run Dgraph).

A dataset was created using the bulk-loader (v20.11.0-rc1-130-gfab88c093) and the p, w, and t directories were distributed to the alpha nodes and the zw directory to the zero node.

Expected behaviour and actual result.

Expected dgraph alpha and zero nodes running dgraph v20.11-rc3 to start properly and for alpha and zero nodes to begin communicating normally.

Whenever the v20.11-rc3 nodes are started the alpha nodes repeatedly output errors such as:

E1209 15:58:37.617404      20 run.go:772] Error while retrieving cors origins: : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection
E1209 15:58:38.592411      20 groups.go:1142] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": error from client.subscribe: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.1.29:7081: connect: connection refused". closer err: <nil>
E1209 15:58:38.596195      20 groups.go:1142] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": error from client.subscribe: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.1.29:7081: connect: connection refused". closer err: <nil>
I1209 15:58:38.600729      20 admin.go:683] Error reading GraphQL schema: : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection.
E1209 15:58:38.618941      20 run.go:772] Error while retrieving cors origins: : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection```

chewxy · December 10, 2020, 2:02am

Hi

this looks like a connection error. Can you make sure that all the instances of Alpha and Zero are shut down before you try running them again?

rmshivers42 · December 10, 2020, 5:05pm

Hey chewxy,

Thank you for the reply.

I have made sure all Alpha and Zero instances are shut down before bringing them back up. I have just confirmed on another dataset that after performing a bulk load with v20.11.0-rc1-130-gfab88c093 I am unable to successfully start Alpha and Zero nodes running v20.11-rc3 and they are logging the same errors.

After a little more investigation it seems the problem is actually with Zero. The error logged in zero is:

I1210 16:36:29.715793       1 node.go:189] Setting conf state to nodes:1
2020/12/10 16:36:29 proto: wrong wireType = 0 for field Groups
github.com/dgraph-io/dgraph/x.Check
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*node).initAndStartNode
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/raft.go:525
github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:254
github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:75
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:71
main.main
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:102
runtime.main
	/usr/local/go/src/runtime/proc.go:204
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1374

I deleted the zw directory and tried to run v20.11-rc3 again and it was able to successfully start all Dgraph Zero and Alpha instances and the data was available.

I also tried testing this with v20.11.0-rc1-164-g7f16bf14 and observed the same behavior. The nodes were unable to start when provided the zw directory created during the bulk load but worked properly if I deleted the zw directory and allowed Zero to start fresh.

It seems the zw directory produced by bulk loading using earlier v20.11 release candidates are incompatible with the newer release candidates.

chewxy · December 10, 2020, 11:10pm

Looks like we might have updated the protobuf structs in between versions.

@ibrahim thoughts?

ibrahim · December 14, 2020, 9:25am

There was an issue on master which was fixed by

github.com/dgraph-io/dgraph

fix: unmarshal snapshot onto zerosnapshot instead of membershipState

dgraph-io:master ← dgraph-io:naman/fix-zero-snapshot

opened 09:29AM - 13 Dec 20 UTC

NamanJain8

+9 -9

We were unmarshalling into incorrect type (`MembershipState`) the snapshot which… contained data marshalled from `ZeroSnapshot`. This PR fixes that.  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/dgraph-io/dgraph/7125)

@rmshivers42 can you please try the latest master from a clean state (delete your p, w, zw, etc) directories.

Topic		Replies	Views
Dgraph alpha node running out of memory Dgraph	5	847	September 28, 2020
All subconns are in TransientFail Dgraph	3	455	August 20, 2020
Dgraph Zero Reports Errors when Alpha Connects Dgraph dgraph , kind:enhancement , status:accepted , area:usability	1	592	July 14, 2020
Alpha Raft.Ready took too long to process Dgraph kind:bug	4	1029	May 19, 2022
Seems to be a memory leak Dgraph status:accepted , ticket:created	26	2182	September 29, 2020