Unexpected meta: 0 - crash loop

My main dgraph node is in a crash loop with the following log output:

++ hostname -f
+ dgraph server --my=dgraph-0.dgraph.default.svc.cluster.local:7080 --lru_mb 5000 --zero dgraph-0.dgraph.default.svc.cluster.local:5080
2018/05/10 18:41:55 groups.go:88: Current Raft Id: 1
2018/05/10 18:41:55 gRPC server started.  Listening on port 9080
2018/05/10 18:41:55 HTTP server started.  Listening on port 8080
2018/05/10 18:41:55 worker.go:99: Worker listening at address: [::]:7080
2018/05/10 18:41:55 pool.go:108: == CONNECT ==> Setting dgraph-0.dgraph.default.svc.cluster.local:5080
2018/05/10 18:41:55 groups.go:115: Connected to group zero. Assigned group: 0
2018/05/10 18:41:55 pool.go:108: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:7080
2018/05/10 18:41:55 pool.go:108: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:7080
2018/05/10 18:41:55 pool.go:108: == CONNECT ==> Setting dgraph-2.dgraph.default.svc.cluster.local:5080
2018/05/10 18:41:55 pool.go:108: == CONNECT ==> Setting dgraph-1.dgraph.default.svc.cluster.local:5080
2018/05/10 18:41:55 draft.go:180: Node ID: 1 with GroupID: 1
2018/05/10 18:41:55 node.go:213: Found Snapshot, Metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:104 Term:424 XXX_unrecognized:[]}
2018/05/10 18:41:55 node.go:228: Found hardstate: {Term:28861 Vote:2 Commit:1962 XXX_unrecognized:[]}
2018/05/10 18:41:55 node.go:240: Group 1 found 1858 entries
2018/05/10 18:41:55 draft.go:936: Restarting node for group: 1
2018/05/10 18:41:55 raft.go:567: INFO: 1 became follower at term 28861
2018/05/10 18:41:55 raft.go:315: INFO: newRaft 1 [peers: [1,2,3], term: 28861, commit: 1962, applied: 104, lastindex: 1962, lastterm: 28861]
2018/05/10 18:41:56 unexpected meta: 0
github.com/dgraph-io/dgraph/x.Fatalf
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/x/error.go:100
github.com/dgraph-io/dgraph/posting.ReadPostingList
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/posting/mvcc.go:423
github.com/dgraph-io/dgraph/posting.getNew
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/posting/mvcc.go:463
github.com/dgraph-io/dgraph/posting.Get
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/posting/lists.go:243
github.com/dgraph-io/dgraph/worker.runMutation
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/mutation.go:79
github.com/dgraph-io/dgraph/worker.(*node).processMutation
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/draft.go:342
github.com/dgraph-io/dgraph/worker.(*scheduler).processTasks
	/home/travis/gopath/src/github.com/dgraph-io/dgraph/worker/scheduler.go:65
runtime.goexit
	/home/travis/.gimme/versions/go1.9.4.linux.amd64/src/runtime/asm_amd64.s:2337

I’m wondering if this issue has resurfaced. I am running 1.0.5.

I see this github issue https://github.com/dgraph-io/dgraph/issues/2102 which is closed. And also there is this thread which is closed as well Dgraph runs into a error loop and freezes the host

Now I have not tried downgrading the cluster yet. Also previous discussions have talked about it being related to a query to schema, I’m not sure how I would figure out what part of the schema is causing it.

Update: Restarting zero on the dgraph-0 node brings the cluster back up.
But it seems one of my applications is triggering the issue, so as soon as I run that app dgraph goes down again. I’ll try to find the root issue.

Update2: Turns out even if I simply use ratel to load the schema it crashes.

So all I did was update the docker image from 1.0.3 to 1.0.5. I’m thinking I’m going to have to wipe all the data and rebuild dgraph in order to keep on working.

Good thing we’re not in production yet.

Before anything, you’re able to export?

https://docs.dgraph.io/deploy/#export-database

You can try to use an instance from scratch (can be some config) for test method.

If so, do it (export it), create a new Dgraph “stack” and recheck configs. And do a load with https://docs.dgraph.io/deploy/#bulk-loader

If the error still persists, it would be interesting to inform you step by step of what you did and to provide more details. Like https://docs.dgraph.io/howto/#retrieving-debug-information

No export wasn’t possible because it was in a crash loop.

Since then as we’re getting closer to production-use, I’ve implemented an automatic export and backup function.

Since I’ve completely wiped and reinstalled the cluster things have been working fine. I have not changed any of the code or the schema.

So I’m going to assume that something went really wrong when switching the docker image from 1.0.3 straight to 1.0.5. Perhaps I should have gone to 1.0.4 first.

1 Like

I think for upgrade cases you could plan a better way out. Like, you have a stack that needs to be “abandoned.” Okay, you’ll start a new Dgraph Stack with the latest version and connect it with that Cluster. Wait for the (new stack) most current version to finish syncing and then kill the old version.

This avoids a number of things. I’m talking about this for myself, I have not yet checked if the Dgraph has it recommended. But it seems a good way out. And plausible, since the Dgraph can work like that and was created for it.

What you think?

Currently I’m using a 3 node kubernetes cluster. If I update the docker image version, it first updates node 3, once it’s rebooted it waits a couple minutes and then does node 2, and so on. This way there is no down time and doesn’t require additional resources to migrate to a new cluster.

But I see what you’re saying, perhaps it could be done automatically using a helm chart. Would require a bit of testing.

Interesting, could you do a test for me? I created a Chart for Rancher, but I have not had time to test it in Kb8’s pure. Could you tell me if this Chart would work in Kb8’s? https://github.com/MichelDiz/dgraph-rancher-catalog-2.0

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.