Report a Dgraph Bug
What version of Dgraph are you using?
v20.11.0-rc2-5-g2188e742c
Have you tried reproducing the issue with the latest release?
yes
What is the hardware spec (RAM, OS)?
docker container image running on GKE (k8s) 8c 20GB
Steps to reproduce the issue (command/config used to run Dgraph).
running the helm chart, no modifications to it. This is possibly a product of running under extreme load - billions of messages.
Expected behaviour and actual result.
Getting a message on movement:
Groups sorted by size: [{gid:5 size:0} {gid:6 size:0} {gid:1 size:15501091218} {gid:4 size:19813233732} {gid:3 size:25540956108} {gid:2 size:93812460468}]
I1202 16:24:55.195162 18 tablet.go:213] size_diff 93812460468
I1202 16:24:55.205029 18 tablet.go:108] Going to move predicate: [bofa-000.name], size: [14 GB] from group 2 to 5
I1202 16:24:55.205174 18 tablet.go:135] Starting move: predicate:"bofa-000.name" source_gid:2 dest_gid:5 txn_ts:6762163
E1202 16:25:03.804186 18 tablet.go:70] while calling MovePredicate: rpc error: code = Unknown desc = file with ID: 12 not found
This movement is being retried every 8m and when it fails it never tries to move another, so I have a wildly un-balanced system. I have moved a few tablets manually to try and balance it out. Looking at the logs of one of group 2 I can see sporadic messages like:
E1202 16:32:55.257031 18 log.go:32] Unable to read: Key: [0 0 13 98 111 102 97 45 48 48 48 46 110 97 109 101 2 10 97 115 49], Version : 2395205, meta: 70, userMeta: 8 Error: file with ID: 12 not found
E1202 16:32:55.340868 18 log.go:32] Unable to read: Key: [0 0 13 98 111 102 97 45 48 48 48 46 110 97 109 101 2 10 44 108 118], Version : 2400128, meta: 70, userMeta: 8 Error: file with ID: 12 not found
E1202 16:32:55.394870 18 log.go:32] Unable to read: Key: [0 0 13 98 111 102 97 45 48 48 48 46 110 97 109 101 2 10 57 46 49], Version : 2400059, meta: 70, userMeta: 8 Error: file with ID: 12 not found
which seems to line up with the move failure. Is there anything I can do to recover from this? I am running scale tests now, but I am not sure how I would fix this in production.
edit: also note that the following query does work (with the effected predicate)
{
q(func: has(<bofa-000.name>),first: 10000){
<bofa-000.name>
}
}