Rollup - Error in readListPart()

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 
[Decoder]: Using assembly version of decoder
Page Size: 4096

Dgraph version   : v20.11.0
Dgraph codename  : tchalla
Dgraph SHA-256   : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1     : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

no

What is the hardware spec (RAM, OS)?

kubernetes(gke) 3 alphas, 1 group - 10core, 10GB ram each, 512Gi ssds.

Steps to reproduce the issue (command/config used to run Dgraph).

Normal helm install from chart.
While troubleshooting slowness while inserting, seeing errors in the logs of all 3 alphas:

W0316 05:37:45.883308      21 mvcc.go:132] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 00000c6a622d68756e742e6e616d65020a633161: could not read list part with key 04000c6a622d68756e742e6e616d65020a6331610000000000000001: Key not found rolling up key [0 0 12 106 98 45 104 117 110 116 46 110 97 109 101 2 10 99 49 97]
W0316 05:37:48.406653      21 mvcc.go:132] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 00000c6a622d68756e742e6e616d65020a633239: could not read list part with key 04000c6a622d68756e742e6e616d65020a6332390000000000000001: Key not found rolling up key [0 0 12 106 98 45 104 117 110 116 46 110 97 109 101 2 10 99 50 57]

(I have given you a couple lines of it here to show its many keys) ~1800 of these log entries in 2hr.
I have traced this through the code in v20.11.0 and have come to this function. Seems to be a badgerdb error from txn.Get(). The cluster is 84 days old.

I am wondering if this means whatever is at that key is corrupt? I have seen this before on my development system and ended up just wiping it and trying again. This is in our production system for a live application so any help would be appreciated.

1 Like

I also currently have a few of these:

E0316 06:28:51.040291      19 log.go:32] Unable to read: Key: [0 0 12 106 98 45 104 117 110 116 46 110 97 109 101 2 10 45 112 111], Version : 90863807, meta: 70, userMeta: 8 Error: file with ID: 1966 not found

which is a error I had reported with this issue in december of last year on the RC of v20.11

I have the Error failed when calling List.rollup on a cluster I spun up 2 weeks ago as well. From where this error comes from(reading the posting list) it would surprise me if it was benign.

Anyone else see this on their clusters?

@iluminae are you still having this issue? I have no idea what is it. But if you still have it, I can ping someone.

yep, ~4100 times in 8h on one of my alphas, 51 times on another (over 30m actually), 0 on the third (in the non-rotated logs anyway).

note this is the same cluster as this issue with an alpha being completely behind on raft applied index. The 4100 is not on the one with a super behind RAFT.

Hi @iluminae,
Not sure if you are still having this issue, but I was also running into it. In my situation, the (corrupt?) keys narrowed down to certain types that had a reverse predicate pointing towards them. I dropped the reverse edge index and recreated/reindexed it. Not an ideal solution, as I have no idea what caused it in the first place, but got everything running as it should again.

Oh that is interesting - are you on v20.11? I have since upgraded our systems to v21.03 and have not seen this on that version.

I am also on v21.03. That is interesting though, I guess I wasn’t quite as lucky with my upgrade. :man_shrugging:

Edited:

Hmm, looks like the same issue popped back up. I can do the same “trick” again and it seems to temporarily fix the issue.
If helpful:

What version of Dgraph are you using?

Dgraph codename  : rocket
Dgraph SHA-256   : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1     : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch           : HEAD
Go version       : go1.16.2
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud   , visit https://dgraph.io/cloud.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"

For development, I am just running one zero and one alpha on the same computer with 32 cores, 64GB ram, and 500 GB nvme.

Steps to reproduce the issue (command/config used to run Dgraph).

Still figuring this out… Seems to be related to certain indexes getting heave read/writes

@balazshoranyi I actually just started seeing this again on my current cluster, exactly as originally posted.

Really, I do not even know if it is causing a devastating issue or not! It would be very difficult for me to identify if a few of my billions of values were missing or something.

Current cluster specs:

Dgraph version   : v21.03.1
Dgraph codename  : rocket-1
Dgraph SHA-256   : a00b73d583a720aa787171e43b4cb4dbbf75b38e522f66c9943ab2f0263007fe
Commit SHA-1     : ea1cb5f35
Commit timestamp : 2021-06-17 20:38:11 +0530
Branch           : HEAD
Go version       : go1.16.2
jemalloc enabled : true

3*zero(4c, 10GiB ram)/ea shardReplicas=3
12*alpha(16c, 32GiB ram, 2TiB disk)/ea

@iluminae For how long your cluster was running? What is the size of the p directory?

The cluster was created 15d ago. Here are the p directory sizes of each peer:

323G   
298G
252G    
41G     
41G     
42G     
33G     
33G     
33G     
75G     
74G     
75G     

(you can see I have one group that is hit extremely often with updates, but they compact down much further than that)

Did you find a real workaround for this? This is happening to me too. Is there a fix?

I do not even know what the real effect is, or if it is even negatively effecting us at all - never got a answer on that.

In my case it meant queries were returning errors instead of the data that I actually requested with a regular expression…

For me there was no errors, just logs. No idea if it was effecting the result set or not. But I have rebuilt my cluster maybe 15 times since this issue was filed.