Did I lose my data for forever?

Hi, if you want to, I could jump on a call with you sometime this week. Let me assuage your fears

I would like to.

Especially, because today, it seems like something goes wrong again.

W0517 09:22:39.802045 21 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595418: {span:0xc017b6e200 compactorId:3 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.738816695401244 adjusted:17.06840375491607 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc035801a40] bot:[0xc0066d2fc0] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3088874 dropPrefixes:[]}
W0517 09:22:39.802141 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595418
W0517 09:22:39.802421 21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595417: {span:0xc007421580 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.69793993009846 adjusted:17.026722126693535 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc008917680] bot:[0xc03478a840] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 109 132 24 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 245 237 240 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2474777 dropPrefixes:[]}
W0517 09:22:39.802507 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595417
W0517 09:22:39.806230 21 log.go:36] [Compactor: 0] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595355: {span:0xc01f19ab80 compactorId:0 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.78681155080123 adjusted:17.11734362832724 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc051bb6840] bot:[0xc084ed1080 0xc083146480] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 21 21 204 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 126 106 134 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 21 21 204 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 203 128 65 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3809907 dropPrefixes:[]}
W0517 09:22:39.806289 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595355
W0517 09:22:39.852077 21 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 2594140: {span:0xc01f19ac80 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.738899035094615 adjusted:17.068487715876817 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc01ca6b380] bot:[0xc0134595c0] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 227 58 247 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 1 65 2 74 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 1 21 76 73 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 1 44 182 93 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3090111 dropPrefixes:[]}
W0517 09:22:39.852198 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2594140
W0517 09:22:39.852500 21 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595418: {span:0xc017b6e300 compactorId:3 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.738816695401244 adjusted:17.06840375491607 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc035801a40] bot:[0xc0066d2fc0] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3088874 dropPrefixes:[]}
W0517 09:22:39.852604 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595418
W0517 09:22:39.857928 21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595417: {span:0xc00c4de100 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.69793993009846 adjusted:17.026722126693535 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc008917680] bot:[0xc03478a840] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 109 132 24 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 245 237 240 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2474777 dropPrefixes:[]}
1 Like

@korjavin2 could you double-check your disk? It seems like the sst files are vanishing while dgraph was running. This could happen only if the data directories were deleted or somehow the attached volume was unmounted.

Also, it seems like you’re able to reproduce the issue. Would you be able to share the steps? I happy to help fix this if you can help me reproduce it.

Well, this is usual google cloud block device.
Never in my life I had problem with them. (yet?).

There are plenty of free space on it.

Could some memory pressing provoke such a behavior?
Like dgraph marked file as written, but it was never created?

May be this could help? I see that it sending something to sentry

I0524 12:10:57.585813 21 log.go:295] Found file: 2903 First Index: 87060001
I0524 12:10:57.585863 21 log.go:295] Found file: 2904 First Index: 87090001
I0524 12:10:57.585888 21 log.go:295] Found file: 2905 First Index: 87120001
I0524 12:10:57.585913 21 log.go:295] Found file: 2906 First Index: 87150001
I0524 12:10:57.585980 21 storage.go:125] Init Raft Storage with snap: 87069157, first: 87069158, last: 87162596
I0524 12:10:57.586006 21 server_state.go:141] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x32c8038 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
E0524 12:10:58.754460 21 log.go:32] Received err: file does not exist for table 2593699. Cleaning up...
[Sentry] 2021/05/24 12:10:58 Sending fatal event [e1c1daec7b854adda36f647738fb340a] to o318308.ingest.sentry.io project: 1805390
2021/05/24 12:10:58 file does not exist for table 2593699
Error while creating badger KV posting store
github.com/dgraph-io/dgraph/x.Checkf
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:145
github.com/dgraph-io/dgraph/worker.InitServerState
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:80
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:760
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:93
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
/usr/local/go/src/runtime/proc.go:225
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1371

I just reproduced that again. And this time I was able to find this in the log (which was tremendous in size)

W0525 17:53:10.405345      20 log.go:36] [Compactor: 2] LOG Compact FAILED with error: sync p/MANIFEST: file already closed: {span:0xc009f4bd80 compactorId:2 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 80759979 807599799 8075997999] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:4 score:1.7191355138911069 adjusted:1.4460537330498309 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 80759979 807599799 8075997999] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc00042d800 nextLevel:0xc00042d860 top:[0xc00a2bf140] bot:[0xc024aaa480] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 49 44 50 44 51 44 52 44 53 44 54 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 55 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 49 44 50 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 97 110 103 101 108 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2277394 dropPrefixes:[]}
W0525 17:53:10.405416      20 log.go:36] While running doCompact: sync p/MANIFEST: file already closed

Some additional info about hardware, and some metrics
zero pod:

alpha pod:

dgraph prom metrics:

our app business metrics (upserts/mutations):

@korjavin2 hey, it looks like the problem is that the MANIFEST file is being closed. The compaction modifies the SST but we are not able to write the new changes to the Manifest file. If the Manifest file doesn’t have information about the new SST, we will not consider that file.

The open question right now is that why are we closing/losing the Manifest file.
@korjavin2 can you share the complete log for zero and alpha?
Can you also share the complete config for your deployment? Are you using the same directory for two alphas by any chance?

This looks like a strange bug because we have not seen this issue before.

1 Like

We have exactly on alpha and exactly one zero pod with separate disks.

I install them from helm chart.
I attach values.yaml that I use.
dgraph.yaml (17.3 KB)

@korjavin2 Can you share the complete alpha and zeros logs, please?

I attach logs

alpha.log (186.4 KB)

UPD:
It’s not really complete (they are huge)
I made an exerpt before time of accident and accident included

There were nothing in zero’s log at that time.

I reproduced this again.

Again, fresh helm installation, and there is alpha log:

00000000000011656261792e6c697374696e672e6e616d65020a20636f: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a20636f0000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 32 99 111]
W0528 10:04:05.271085      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a323031: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a3230310000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 50 48 49]
W0528 10:04:05.393619      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000016656261792e6c697374696e672e6861735f696d6167650400000000000c49f0: could not read list part with key 0400000000000000000016656261792e6c697374696e672e6861735f696d6167650400000000000c49f00000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 22 101 98 97 121 46 108 105 115 116 105 110 103 46 104 97 115 95 105 109 97 103 101 4 0 0 0 0 0 12 73 240]
W0528 10:04:06.166370      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a204e45: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a204e450000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 32 78 69]
W0528 10:04:07.019532      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000015656261792e6c697374696e672e6c6f636174696f6e020a204361: could not read list part with key 0400000000000000000015656261792e6c697374696e672e6c6f636174696f6e020a2043610000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 67 97]
I0528 10:04:09.623414      21 log.go:34] Block cache metrics: hit: 479279888 miss: 4433230323 keys-added: 210222525 keys-updated: 2219115 keys-evicted: 210107693 cost-added: 3880628428961 cost-evicted: 3879930501081 sets-dropped: 462953 sets-rejected: 7334269 gets-dropped: 11802368 gets-kept: 4720027264 gets-total: 4912510211 hit-ratio: 0.10
W0528 10:04:10.846704      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000015656261792e6c697374696e672e6c6f636174696f6e02016765726d616e79: could not read list part with key 0400000000000000000015656261792e6c697374696e672e6c6f636174696f6e02016765726d616e790000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 1 103 101 114 109 97 110 121]
W0528 10:04:11.445106      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000016656261792e6c697374696e672e6861735f696d61676504000000000028f8ef: could not read list part with key 0400000000000000000016656261792e6c697374696e672e6861735f696d61676504000000000028f8ef0000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 22 101 98 97 121 46 108 105 115 116 105 110 103 46 104 97 115 95 105 109 97 103 101 4 0 0 0 0 0 40 248 239]
W0528 10:04:11.982254      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a33343b: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a33343b0000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 51 52 59]
W0528 10:04:12.122931      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a6f7665: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a6f76650000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 111 118 101]
W0528 10:04:12.841819      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a303334: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a3033340000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 48 51 52]
W0528 10:04:14.334646      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a207669: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a2076690000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 32 118 105]
W0528 10:04:17.034648      21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: sync p/MANIFEST: file already closed: {span:0xc0057fcc00 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:4 score:1.4644237509096527 adjusted:2.312061263167869 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00042c3c0 nextLevel:0xc00042c420 top:[0xc03e179d40] bot:[0xc0115c1d40 0xc013279200 0xc0446a9800] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 72 83 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 110 32 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 83 107 121 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 118 97 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2113661 dropPrefixes:[]}
W0528 10:04:17.034724      21 log.go:36] While running doCompact: sync p/MANIFEST: file already closed
W0528 10:04:17.566470      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 000000000000000000001b656261792e6c697374696e672e736f6c645f62795f73656c6c6572040000000000001374: could not read list part with key 040000000000000000001b656261792e6c697374696e672e736f6c645f62795f73656c6c65720400000000000013740000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 27 101 98 97 121 46 108 105 115 116 105 110 103 46 115 111 108 100 95 98 121 95 115 101 108 108 101 114 4 0 0 0 0 0 0 19 116]
W0528 10:04:19.196584      21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 3752678: {span:0xc00d154c00 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:4 score:1.370733526008833 adjusted:2.142197352896229 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00042c3c0 nextLevel:0xc00042c420 top:[0xc022974780] bot:[0xc0446a9800] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 110 38 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 118 97 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 99 105 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 118 97 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2267643 dropPrefixes:[]}
W0528 10:04:19.196665      21 log.go:36] While running doCompact: MANIFEST removes non-existing table 3752678
W0528 10:04:19.200899      21 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 3752679: {span:0xc00a749300 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:4 score:1.3729726516323733 adjusted:2.14569668292102 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00042c3c0 nextLevel:0xc00042c420 top:[0xc034eff8c0] bot:[0xc013279d40 0xc0115c1d40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 80 73 80 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 72 82 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 77 105 100 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 32 86 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2230336 dropPrefixes:[]}

UPD:
I disabled google cloud persistent disk and run it just on node disks.

Nothing changed.
In 3 days I got the same issues again:

W0614 13:59:18.326329 20 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 113984: {span:0xc0097be280 compactorId:2 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.36615171432495 adjusted:21.054207908706115 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc016b760c0] bot:[0xc01302fd40 0xc00d15d5c0] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 110 102 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 114 116 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 103 117 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 101 108 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2124381 dropPrefixes:[]}
W0614 13:59:18.326423 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113984
W0614 13:59:18.326469 20 log.go:36] [Compactor: 0] LOG Compact FAILED with error: MANIFEST removes non-existing table 113988: {span:0xc00ca4c580 compactorId:0 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.558347129821776 adjusted:26.485386104764444 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc0155da240] bot:[0xc00d4a3ec0] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 116 108 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 101 100 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 101 109 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 119 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2015315 dropPrefixes:[]}
W0614 13:59:18.326555 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113988
W0614 13:59:18.326541 20 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 113979: {span:0xc00847e180 compactorId:3 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.556907653808594 adjusted:26.483767779090424 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc00e532000] bot:[0xc00928fd40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 83 105 114 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 97 103 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 82 101 115 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 104 101 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2000221 dropPrefixes:[]}
W0614 13:59:18.326577 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113979
W0614 13:59:18.326925 20 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 112879: {span:0xc0064b4080 compactorId:1 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:4 score:1.110077861853078 adjusted:1.116691118230626 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d31a0 nextLevel:0xc0000d3200 top:[0xc00f389b00] bot:[0xc0155dba40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 120 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 103 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 32 66 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 68 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:4005514 dropPrefixes:[]}
W0614 13:59:18.327005 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 112879
W0614 13:59:18.366050 20 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 113988: {span:0xc00847e380 compactorId:3 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.558347129821776 adjusted:26.485386104764444 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc0155da240] bot:[0xc00d4a3ec0] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 116 108 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 101 100 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 101 109 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 119 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2015315 dropPrefixes:[]}
W0614 13:59:18.366129 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113988
W0614 13:59:18.366108 20 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 113991: {span:0xc0097be300 compactorId:1 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.36615171432495 adjusted:21.04911062303261 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc0147c23c0] bot:[0xc00f389b00] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 85 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 100 32 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 120 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 103 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2368028 dropPrefixes:[]}
W0614 13:59:18.366203 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113991
W0614 13:59:18.366261 20 log.go:36] [Compactor: 0] LOG Compact FAILED with error: MANIFEST removes non-existing table 113979: {span:0xc00847e300 compactorId:0 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.558347129821776 adjusted:26.485386104764444 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc00e532000] bot:[0xc00928fd40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 83 105 114 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 97 103 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 82 101 115 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 104 101 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2000221 dropPrefixes:[]}
W0614 13:59:18.366554 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113979

Looks pretty critical to me.

1 Like

I filed this issue on friday to no answer that was (unknowingly) a duplicate of this.

I completely rebuilt my production system on friday night to get around this (I have 3 shards so only one in that group is borked.) This just happened again on the new system. No restarts, brand new system as of saturday night - built from bulk loader, if that is interesting.

One of the peers just gives up with this MANIFEST removes non-existing table X and a restart will make it crashloop with the panic in the linked issue.

Here is the raft applied index of that group, you can see when the one encountered this error:

note: unlike OP I have 12 alphas, but it seems this could happen at any time. Each alpha is the only pod on a GKE VM. I have been running dgraph for over a year and have never encountered this until upgrade to v21.03.

Also see here, the memory usage of the node with this error increases where the others are low.

@ibrahim, please - there is some critical issue in badger it seems. If another peer in the same group corrupts thats probably full data loss without any further guidance.

2 Likes

Uh, thank you a lot.

I was deeply worried how could it be non critical, and why only we are facing this (constantly)

@ibrahim @MichelDiz @chewxy Can we get an acknowledgement on this?

The discussion has moved to Starting dgraph results in fatal badger error (file does not exist for table) . Closing this thread.