Starting dgraph results in fatal badger error (file does not exist for table)

iluminae · June 11, 2021, 9:05pm

What version of Dgraph are you using?

Dgraph Version

Dgraph version   : v21.03.0
Dgraph codename  : rocket
Dgraph SHA-256   : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1     : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch           : HEAD
Go version       : go1.16.2
jemalloc enabled : true

What is the hardware spec (RAM, OS)?

32cpu 32GiB ram x12 alphas (4 groups, 3 replicas) (384cpu, 384GiB total) running it in k8s on GKE.

One of my nodes was stuck in a loop of not doing anything (applied raft log did not change) and was continually printing this message:

W0611 20:57:32.266066      22 log.go:36] While running doCompact: MANIFEST removes non-existing table 507614

So I restarted it (just sent it one signal, did not force it down by killing it or anything), and now the node will not start, with this panic:

E0611 20:58:31.026625      21 log.go:32] Received err: file does not exist for table 506439. Cleaning up...
[Sentry] 2021/06/11 20:58:31 Sending fatal event [dacea61cdeb44f538025ace4158c311f] to o318308.ingest.sentry.io project: 1805390
2021/06/11 20:58:31 file does not exist for table 506439
Error while creating badger KV posting store
github.com/dgraph-io/dgraph/x.Checkf
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:145
github.com/dgraph-io/dgraph/worker.InitServerState
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:80
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:760
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:93
github.com/spf13/cobra.(*Command).execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
        /usr/local/go/src/runtime/proc.go:225
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1371

How can I recover this server?

iluminae · June 11, 2021, 9:08pm

Here is the raft applied index for that group for more flavor:

iluminae · June 14, 2021, 2:20pm

rebuilt my system completely on friday, eventually a different node exhibited the same issue in the entirely new system.

posted on this thread that this is a duplicate of that.

bumping here for possible visibility.

iluminae · June 14, 2021, 5:40pm

I was getting nervous with one peer of the group down so I put the result of the bulkload p directory for that group back in the place of the corrupted p directory, and the leader sent a snapshot to that peer - I think thats ok…

This does not solve the real issue here but maybe helped my cluster become normal. Or, I have messed everything up and do not know it.

edit: ok that was a horrible idea, that server was then missing data after it came back and looked healthy, I think it was missing all txns from bulkload->when it corrupted. Meaning the snapshot only applied after the corruption to current.

iluminae · June 15, 2021, 6:40pm

bump @MichelDiz @ibrahim (i am not allowed to post on the other issue again)

Just looking for acknowledgement and how to help diagnose this.

mrjn · June 15, 2021, 6:56pm

Hey @iluminae ,

Hard to tell what’s gone wrong here, without debugging. We haven’t seen this issue in Dgraph Cloud. @dmai – have you guys seen this one?

It seems to suggest that there’s a table missing when it should be present. I know this might be hard, but is there a way to replicate this?

iluminae · June 15, 2021, 7:03pm

@mrjn thanks for a response -
I have a hunch it has to do with massive insert frequency, possibly number of transactions completed in a very short window - it has happened to me twice in the past week and does so directly after a complete system rebuild.

During a rebuild, I have to turn off our ingestion as to create a backlog. When the system is completely rebuilt hours later, the backlog is possibly millions and is inserted very fast - that is my only hint to a cause. I dont know if it will help but I have not rebuilt my system after the last corruption, and I have the corrupted p directory, I could offer you the MANIFEST file or others if it would help any. I plan to rebuild tonight to fix this corruption.

Is there any way to recover a corrupted peer? Only idea I had was to remove the peer by ID and readd him with no state and a different id - but that is really extreme in the case of running in k8s, where the id comes from the statesfulset pod ordinal, and would become unsupportable quick.

mrjn · June 15, 2021, 9:09pm

Hey @iluminae ,

Is it possible for you to try this workload out on Dgraph Cloud? That ways we can monitor the backend and see if we encounter this issue. We can probably give you a trial instance for a week to help replicate this issue. CC: @dmai

dmai · June 15, 2021, 9:13pm

@iluminae We can set you up with a trial instance if you’d like.

You can call /removeNode following these K8s-specific steps. Alphas can get an auto-assigned Raft ID, so it’s not baked as part of the statefulset ordinal.

iluminae · June 15, 2021, 9:31pm

oh man @dmai that is exactly what I was looking for. I dont know where I was thinking the ordinal was the raft id… maybe that was a long time ago (or a different db, idk). Having the proceedure is really what I needed here. I will do this instead of rebuilding my whole cluster tonight.

@mrjn I would totally be game with forking my prod data to insert to dgraph cloud as well, but I have to figure out if that is kosher since it is my customer’s data. (maybe I can build a value scrambler into the insertion pipeline and have that be ok)

dmai · June 15, 2021, 10:26pm

The ordinal thing is still part of the Zero StatefulSet in our example K8s manifests/Helm chart, for what it’s worth.

When you’re ready to load your setup in Dgraph Cloud so we can debug this issue, just let me know and we can set it up for you.

iluminae · June 15, 2021, 11:25pm

(marked that comment above about the procedure to rebuild a broken node in k8s as that was my immediate concern)

@dmai I would love to help with reproducing on your side, I have to figure out if its ok to ship out that kind of metadata or if I will have to obscure it. I will pm you when I figure that out.

korjavin2 · June 16, 2021, 8:16am

We can reproduce it easily.
In fact we are doing this.
Every 2-3 days we lose all our dgraph data.

It was mentioned already, but again:

iluminae · June 16, 2021, 1:42pm

Yea you should instead set up a dedicated instance for @korjavin2 since he has reproduced it more than I have and I am not positive if I can send all of the metadata as is.

mrjn · June 16, 2021, 3:02pm

@korjavin2 – can you DM @dmai to get a dedicated instance for a week? So we can try to reproduce this issue.

korjavin2 · June 16, 2021, 3:27pm

Yup, I just did.

It will not that straight forward (Perhaps I need to obfuscate my data)
but it’s definitely doable.

iluminae · June 17, 2021, 5:34pm

Just happend to one of my alphas again, this time on a leader, which seems to have been far worse as it blocked queries until I killed the node and one of the other ones in the group took over as leader.

ibrahim · June 19, 2021, 9:16am

Since @korjavin2 and @iluminae have repeatedly seen the issue, it might be useful to add more verbose logging in badger compactions to figure out why/when was this particular table was deleted.

Also, if this happens again, please save the logs of Alpha that’s acting weird and collect the output of badger info --dir /path/to/pdir (note: You might need to set the readonly=false in info command if alpha didn’t shut down properly).

iluminae · June 19, 2021, 12:08pm

Seems prudent, I’ll try to catch it before logs rotate much.

mrjn · June 28, 2021, 6:00pm

So, the cluster ran for a week in the cloud. We didn’t see this particular file does not exist issue. In fact, we just ran dgraph increment on the cluster, and it’s working fine.

We did happen to see a posting list rollup issue, which we had seen elsewhere too. We will debug that this week.

Topic		Replies	Views
Received err: file does not exist for table 267629. Cleaning up Dgraph kind:bug	3	1008	September 24, 2021
Did I lose my data for forever? Dgraph dgraph , kind:bug	23	1839	June 16, 2021
LOG Compact FAILED with error: MANIFEST removes non-existing table 15777621, Dgraph Cloud dgraph	19	1685	October 2, 2021
Corrupt database - unable to restart Dgraph Dgraph	34	2264	September 11, 2020
dGraph Server crash and cannot start after restart Dgraph	2	933	June 7, 2018

Starting dgraph results in fatal badger error (file does not exist for table)

What version of Dgraph are you using?

What is the hardware spec (RAM, OS)?

Related topics