Starting dgraph results in fatal badger error (file does not exist for table)

What version of Dgraph are you using?

Dgraph Version
Dgraph version   : v21.03.0
Dgraph codename  : rocket
Dgraph SHA-256   : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1     : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch           : HEAD
Go version       : go1.16.2
jemalloc enabled : true

What is the hardware spec (RAM, OS)?

32cpu 32GiB ram x12 alphas (4 groups, 3 replicas) (384cpu, 384GiB total) running it in k8s on GKE.

One of my nodes was stuck in a loop of not doing anything (applied raft log did not change) and was continually printing this message:

W0611 20:57:32.266066      22 log.go:36] While running doCompact: MANIFEST removes non-existing table 507614

So I restarted it (just sent it one signal, did not force it down by killing it or anything), and now the node will not start, with this panic:

E0611 20:58:31.026625      21 log.go:32] Received err: file does not exist for table 506439. Cleaning up...
[Sentry] 2021/06/11 20:58:31 Sending fatal event [dacea61cdeb44f538025ace4158c311f] to o318308.ingest.sentry.io project: 1805390
2021/06/11 20:58:31 file does not exist for table 506439
Error while creating badger KV posting store
github.com/dgraph-io/dgraph/x.Checkf
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:145
github.com/dgraph-io/dgraph/worker.InitServerState
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:80
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:760
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:93
github.com/spf13/cobra.(*Command).execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
        /usr/local/go/src/runtime/proc.go:225
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1371

How can I recover this server?

1 Like

Here is the raft applied index for that group for more flavor:

rebuilt my system completely on friday, eventually a different node exhibited the same issue in the entirely new system.

posted on this thread that this is a duplicate of that.

bumping here for possible visibility.

1 Like

I was getting nervous with one peer of the group down so I put the result of the bulkload p directory for that group back in the place of the corrupted p directory, and the leader sent a snapshot to that peer - I think thats ok…

This does not solve the real issue here but maybe helped my cluster become normal. Or, I have messed everything up and do not know it.

edit: ok that was a horrible idea, that server was then missing data after it came back and looked healthy, I think it was missing all txns from bulkload->when it corrupted. Meaning the snapshot only applied after the corruption to current.

bump @MichelDiz @ibrahim (i am not allowed to post on the other issue again)

Just looking for acknowledgement and how to help diagnose this.

Hey @iluminae ,

Hard to tell what’s gone wrong here, without debugging. We haven’t seen this issue in Dgraph Cloud. @dmai – have you guys seen this one?

It seems to suggest that there’s a table missing when it should be present. I know this might be hard, but is there a way to replicate this?

@mrjn thanks for a response -
I have a hunch it has to do with massive insert frequency, possibly number of transactions completed in a very short window - it has happened to me twice in the past week and does so directly after a complete system rebuild.

During a rebuild, I have to turn off our ingestion as to create a backlog. When the system is completely rebuilt hours later, the backlog is possibly millions and is inserted very fast - that is my only hint to a cause. I dont know if it will help but I have not rebuilt my system after the last corruption, and I have the corrupted p directory, I could offer you the MANIFEST file or others if it would help any. I plan to rebuild tonight to fix this corruption.

Is there any way to recover a corrupted peer? Only idea I had was to remove the peer by ID and readd him with no state and a different id - but that is really extreme in the case of running in k8s, where the id comes from the statesfulset pod ordinal, and would become unsupportable quick.

Hey @iluminae ,

Is it possible for you to try this workload out on Dgraph Cloud? That ways we can monitor the backend and see if we encounter this issue. We can probably give you a trial instance for a week to help replicate this issue. CC: @dmai

@iluminae We can set you up with a trial instance if you’d like.

You can call /removeNode following these K8s-specific steps. Alphas can get an auto-assigned Raft ID, so it’s not baked as part of the statefulset ordinal.

oh man @dmai that is exactly what I was looking for. I dont know where I was thinking the ordinal was the raft id… maybe that was a long time ago (or a different db, idk). Having the proceedure is really what I needed here. I will do this instead of rebuilding my whole cluster tonight.

@mrjn I would totally be game with forking my prod data to insert to dgraph cloud as well, but I have to figure out if that is kosher since it is my customer’s data. (maybe I can build a value scrambler into the insertion pipeline and have that be ok)

The ordinal thing is still part of the Zero StatefulSet in our example K8s manifests/Helm chart, for what it’s worth.

When you’re ready to load your setup in Dgraph Cloud so we can debug this issue, just let me know and we can set it up for you.

(marked that comment above about the procedure to rebuild a broken node in k8s as that was my immediate concern)

@dmai I would love to help with reproducing on your side, I have to figure out if its ok to ship out that kind of metadata or if I will have to obscure it. I will pm you when I figure that out.

We can reproduce it easily.
In fact we are doing this.
Every 2-3 days we lose all our dgraph data.

It was mentioned already, but again:

Yea you should instead set up a dedicated instance for @korjavin2 since he has reproduced it more than I have and I am not positive if I can send all of the metadata as is.

@korjavin2 – can you DM @dmai to get a dedicated instance for a week? So we can try to reproduce this issue.

Yup, I just did.

It will not that straight forward (Perhaps I need to obfuscate my data)
but it’s definitely doable.

Just happend to one of my alphas again, this time on a leader, which seems to have been far worse as it blocked queries until I killed the node and one of the other ones in the group took over as leader.

Since @korjavin2 and @iluminae have repeatedly seen the issue, it might be useful to add more verbose logging in badger compactions to figure out why/when was this particular table was deleted.

Also, if this happens again, please save the logs of Alpha that’s acting weird and collect the output of badger info --dir /path/to/pdir (note: You might need to set the readonly=false in info command if alpha didn’t shut down properly).

Seems prudent, I’ll try to catch it before logs rotate much.