Error: file with ID: X not found

Report a Dgraph Bug

What version of Dgraph are you using?

v20.11.0-rc2-5-g2188e742c

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

docker container image running on GKE (k8s) 8c 20GB

Steps to reproduce the issue (command/config used to run Dgraph).

running the helm chart, no modifications to it. This is possibly a product of running under extreme load - billions of messages.

Expected behaviour and actual result.

Getting a message on movement:

Groups sorted by size: [{gid:5 size:0} {gid:6 size:0} {gid:1 size:15501091218} {gid:4 size:19813233732} {gid:3 size:25540956108} {gid:2 size:93812460468}]

I1202 16:24:55.195162      18 tablet.go:213] size_diff 93812460468
I1202 16:24:55.205029      18 tablet.go:108] Going to move predicate: [bofa-000.name], size: [14 GB] from group 2 to 5
I1202 16:24:55.205174      18 tablet.go:135] Starting move: predicate:"bofa-000.name" source_gid:2 dest_gid:5 txn_ts:6762163
E1202 16:25:03.804186      18 tablet.go:70] while calling MovePredicate: rpc error: code = Unknown desc = file with ID: 12 not found

This movement is being retried every 8m and when it fails it never tries to move another, so I have a wildly un-balanced system. I have moved a few tablets manually to try and balance it out. Looking at the logs of one of group 2 I can see sporadic messages like:

E1202 16:32:55.257031      18 log.go:32] Unable to read: Key: [0 0 13 98 111 102 97 45 48 48 48 46 110 97 109 101 2 10 97 115 49], Version : 2395205, meta: 70, userMeta: 8 Error: file with ID: 12 not found
E1202 16:32:55.340868      18 log.go:32] Unable to read: Key: [0 0 13 98 111 102 97 45 48 48 48 46 110 97 109 101 2 10 44 108 118], Version : 2400128, meta: 70, userMeta: 8 Error: file with ID: 12 not found
E1202 16:32:55.394870      18 log.go:32] Unable to read: Key: [0 0 13 98 111 102 97 45 48 48 48 46 110 97 109 101 2 10 57 46 49], Version : 2400059, meta: 70, userMeta: 8 Error: file with ID: 12 not found

which seems to line up with the move failure. Is there anything I can do to recover from this? I am running scale tests now, but I am not sure how I would fix this in production.

edit: also note that the following query does work (with the effected predicate)

{
  q(func: has(<bofa-000.name>),first: 10000){
    <bofa-000.name>
  }
}
1 Like

I noticed that some of the pods have restarted as a result of OOMKILL, is that possibly what could cause this? and (more importantly), how would one recover from this? can I simply remove this nodes’ state and re-add him to the group or something?

Hey @iluminae, do you have a way to reproduce this? we’ve seen this issue earlier as well but we weren’t able to reproduce it.

@ajeet is looking into this issue.

I am really not sure - this is the third cluster I have found it on myself (all in the last couple releases). oomkills are rather prevalent on our systems with billions of messages being inserted (which is another problem I would like to solve). The only thing I can think of is a oomkill does not allow badger to close correctly or something. As I put above, this is on GKE. I have seen it on systems with one group and systems with multiple groups. All of the systems had 3 shard replicas in a group (in case it has to do with tablet replication)

Thanks @iluminae. We’re looking into this issue actively and we’ll keep this issue updated with our findings. Please do let us know if you manage to reproduce it reliably.

1 Like

Hey there. We’re seeing this issue sporadically across instances that are running badger3. Was there ever any fix for this? Essentially what happens is the database becomes corrupted and outputs this error when we try to read from it to debug. Can provide more information if needed, plus a badgerdb directory that produces this error reliably when one tries to read from it.

this also can cause over-logging to the point of pegging a server’s CPU apparently, as I found in Log statement spamming, causing unresponsive system