Did I lose my data for forever?

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 [Decoder]: Using assembly version of decoder
Page Size: 4096

Dgraph version   : v20.11.3
Dgraph codename  : tchalla-3
Dgraph SHA-256   : c3c1474369415e74b1a59ec7053cd1e585c9d55fe71243c72c48e313728d995a
Commit SHA-1     : 8d3eb766c
Commit timestamp : 2021-03-31 17:28:12 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs/.
For discussions about Dgraph     , visit https://discuss.dgraph.io.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

Alpha:

resources:
requests:
memory: 38800Mi
cpu: 23
limits:
memory: 38800Mi
cpu: 24

Zero:

requests:
  memory: 6400Mi
  cpu: 7500m
limits:
  memory: 10400Mi
  cpu: 8500m

Steps to reproduce the issue (command/config used to run Dgraph).

After some time of using dgraph I found this in the alpha pod log:

W0506 08:51:36.496078 22 log.go:36] While running doCompact: MANIFEST removes non-existing table 4684521
W0506 08:51:39.037852 22 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 4684521: {span:0xc00c0ca100 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 23078611 230786118 2307861185 23078611853] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:0 score:3 adjusted:3.2947788531481423 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 23078611 230786118 2307861185 23078611853] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00060f020 nextLevel:0xc00060f0e0 top:[0xc04b411140 0xc014ef1440 0xc052bb0a80 0xc078c59200 0xc06a1b2540 0xc0314392c0 0xc0061d75c0 0xc020b62000 0xc03b7a2540 0xc064470f00 0xc037887e00 0xc024ca8000 0xc04d9a9680 0xc032b758c0 0xc022d70240] bot:[0xc0333a6b40 0xc053706540 0xc02de96000 0xc019e18240] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 3 120 105 100 2 2 82 49 78 73 68 88 53 52 76 79 65 69 76 85 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 32 101 98 97 121 46 108 105 115 116 105 110 103 46 98 101 108 111 110 103 115 95 116 111 95 99 97 116 101 103 111 114 121 4 0 0 0 0 11 16 85 75 0 0 0 0 13 57 202 2 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 116 99 104 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 11 100 103 114 97 112 104 46 116 121 112 101 2 2 101 98 97 121 46 108 105 115 116 105 110 103 0 0 0 0 11 104 132 188 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:0 dropPrefixes:[]}
W0506 08:51:39.037996 22 log.go:36] While running doCompact: MANIFEST removes non-existing table 4684521
W0506 08:51:41.607176 22 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 4684521: {span:0xc007376a00 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 23078611 230786118 2307861185 23078611853] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:0 score:3 adjusted:3.2947788531481423 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 23078611 230786118 2307861185 23078611853] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00060f020 nextLevel:0xc00060f0e0 top:[0xc04b411140 0xc014ef1440 0xc052bb0a80 0xc078c59200 0xc06a1b2540 0xc0314392c0 0xc0061d75c0 0xc020b62000 0xc03b7a2540 0xc064470f00 0xc037887e00 0xc024ca8000 0xc04d9a9680 0xc032b758c0 0xc022d70240] bot:[0xc0333a6b40 0xc053706540 0xc02de96000 0xc019e18240] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 3 120 105 100 2 2 82 49 78 73 68 88 53 52 76 79 65 69 76 85 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 32 101 98 97 121 46 108 105 115 116 105 110 103 46 98 101 108 111 110 103 115 95 116 111 95 99 97 116 101 103 111 114 121 4 0 0 0 0 11 16 85 75 0 0 0 0 13 57 202 2 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 116 99 104 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 11 100 103 114 97 112 104 46 116 121 112 101 2 2 101 98 97 121 46 108 105 115 116 105 110 103 0 0 0 0 11 104 132 188 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:0 dropPrefixes:[]}
W0506 08:51:41.607325 22 log.go:36] While running doCompact: MANIFEST removes non-existing table 4684521
W0506 08:51:44.201397 22 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 4684521: {span:0xc029dfb700 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 23078611 230786118 2307861185 23078611853] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:0 score:3 adjusted:3.2947788531481423 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 23078611 230786118 2307861185 23078611853] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00060f020 nextLevel:0xc00060f0e0 top:[0xc04b411140 0xc014ef1440 0xc052bb0a80 0xc078c59200 0xc06a1b2540 0xc0314392c0 0xc0061d75c0 0xc020b62000 0xc03b7a2540 0xc064470f00 0xc037887e00 0xc024ca8000 0xc04d9a9680 0xc032b758c0 0xc022d70240] bot:[0xc0333a6b40 0xc053706540 0xc02de96000 0xc019e18240] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 3 120 105 100 2 2 82 49 78 73 68 88 53 52 76 79 65 69 76 85 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 32 101 98 97 121 46 108 105 115 116 105 110 103 46 98 101 108 111 110 103 115 95 116 111 95 99 97 116 101 103 111 114 121 4 0 0 0 0 11 16 85 75 0 0 0 0 13 57 202 2 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 116 99 104 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 11 100 103 114 97 112 104 46 116 121 112 101 2 2 101 98 97 121 46 108 105 115 116 105 110 103 0 0 0 0 11 104 132 188 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:0 dropPrefixes:[]}
W0506 08:51:44.201520 22 log.go:36] While running doCompact: MANIFEST removes non-existing table 4684521

Meanwhile it seemed to be working (liveliness probes kept passing) but I couldn’t use it.
No requests were able to complete.

I tried to restart alpha pod, it didn’t help. Same behavior.

I added more hardware (x2) and restarted again.

After this, behavior changed, I’ve got

I0506 09:08:07.849888 21 storage.go:125] Init Raft Storage with snap: 1314625848, first: 1314625849, last: 1314774484
I0506 09:08:07.849905 21 server_state.go:141] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x32c8038 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
E0506 09:08:09.847222 21 log.go:32] Received err: file does not exist for table 4679668. Cleaning up...
[Sentry] 2021/05/06 09:08:09 Sending fatal event [2f384924cf264f9fbd53d297c5c9fb10] to o318308.ingest.sentry.io project: 1805390
2021/05/06 09:08:09 file does not exist for table 4679668
Error while creating badger KV posting store
github.com/dgraph-io/dgraph/x.Checkf
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:145
github.com/dgraph-io/dgraph/worker.InitServerState
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:80
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:760
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:93
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
/usr/local/go/src/runtime/proc.go:225
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1371

So, no I can’t start dgraph anymore.

Why? What happened with this table?

And what is more important, how to recover my data if possible?

I use google cloud permanent disks, and it seems everything is fine with this disk.

@MichelDiz sorry for mentioning you directly.

But I am confused with the situation, and don’t know where to continue.

@ibrahim or @dmai Any idea?

I made a backup of my disks. alpha/zero.

I created new installation and started over from zero (lost 94 billions docs).

Now it seems I will use that with fear.

I don’t understand what it was. How to fix it.

Hi, if you want to, I could jump on a call with you sometime this week. Let me assuage your fears

I would like to.

Especially, because today, it seems like something goes wrong again.

W0517 09:22:39.802045 21 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595418: {span:0xc017b6e200 compactorId:3 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.738816695401244 adjusted:17.06840375491607 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc035801a40] bot:[0xc0066d2fc0] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3088874 dropPrefixes:[]}
W0517 09:22:39.802141 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595418
W0517 09:22:39.802421 21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595417: {span:0xc007421580 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.69793993009846 adjusted:17.026722126693535 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc008917680] bot:[0xc03478a840] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 109 132 24 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 245 237 240 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2474777 dropPrefixes:[]}
W0517 09:22:39.802507 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595417
W0517 09:22:39.806230 21 log.go:36] [Compactor: 0] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595355: {span:0xc01f19ab80 compactorId:0 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.78681155080123 adjusted:17.11734362832724 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc051bb6840] bot:[0xc084ed1080 0xc083146480] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 21 21 204 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 126 106 134 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 21 21 204 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 203 128 65 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3809907 dropPrefixes:[]}
W0517 09:22:39.806289 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595355
W0517 09:22:39.852077 21 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 2594140: {span:0xc01f19ac80 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.738899035094615 adjusted:17.068487715876817 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc01ca6b380] bot:[0xc0134595c0] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 227 58 247 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 1 65 2 74 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 1 21 76 73 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 1 44 182 93 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3090111 dropPrefixes:[]}
W0517 09:22:39.852198 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2594140
W0517 09:22:39.852500 21 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595418: {span:0xc017b6e300 compactorId:3 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.738816695401244 adjusted:17.06840375491607 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc035801a40] bot:[0xc0066d2fc0] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 2 3 28 240 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 44 32 85 0 0 0 0 0 0 0 1 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:3088874 dropPrefixes:[]}
W0517 09:22:39.852604 21 log.go:36] While running doCompact: MANIFEST removes non-existing table 2595418
W0517 09:22:39.857928 21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 2595417: {span:0xc00c4de100 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:3 score:16.69793993009846 adjusted:17.026722126693535 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 15023131 150231310 1502313109 15023131091] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc000680900 nextLevel:0xc000680960 top:[0xc008917680] bot:[0xc03478a840] thisRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 109 132 24 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 174 228 121 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 85 110 0 0 0 0 1 245 237 240 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2474777 dropPrefixes:[]}

@korjavin2 could you double-check your disk? It seems like the sst files are vanishing while dgraph was running. This could happen only if the data directories were deleted or somehow the attached volume was unmounted.

Also, it seems like you’re able to reproduce the issue. Would you be able to share the steps? I happy to help fix this if you can help me reproduce it.

Well, this is usual google cloud block device.
Never in my life I had problem with them. (yet?).

There are plenty of free space on it.

Could some memory pressing provoke such a behavior?
Like dgraph marked file as written, but it was never created?

May be this could help? I see that it sending something to sentry

I0524 12:10:57.585813 21 log.go:295] Found file: 2903 First Index: 87060001
I0524 12:10:57.585863 21 log.go:295] Found file: 2904 First Index: 87090001
I0524 12:10:57.585888 21 log.go:295] Found file: 2905 First Index: 87120001
I0524 12:10:57.585913 21 log.go:295] Found file: 2906 First Index: 87150001
I0524 12:10:57.585980 21 storage.go:125] Init Raft Storage with snap: 87069157, first: 87069158, last: 87162596
I0524 12:10:57.586006 21 server_state.go:141] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x32c8038 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
E0524 12:10:58.754460 21 log.go:32] Received err: file does not exist for table 2593699. Cleaning up...
[Sentry] 2021/05/24 12:10:58 Sending fatal event [e1c1daec7b854adda36f647738fb340a] to o318308.ingest.sentry.io project: 1805390
2021/05/24 12:10:58 file does not exist for table 2593699
Error while creating badger KV posting store
github.com/dgraph-io/dgraph/x.Checkf
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:145
github.com/dgraph-io/dgraph/worker.InitServerState
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:80
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:760
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:93
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
/usr/local/go/src/runtime/proc.go:225
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1371

I just reproduced that again. And this time I was able to find this in the log (which was tremendous in size)

W0525 17:53:10.405345      20 log.go:36] [Compactor: 2] LOG Compact FAILED with error: sync p/MANIFEST: file already closed: {span:0xc009f4bd80 compactorId:2 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 80759979 807599799 8075997999] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:4 score:1.7191355138911069 adjusted:1.4460537330498309 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 80759979 807599799 8075997999] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc00042d800 nextLevel:0xc00042d860 top:[0xc00a2bf140] bot:[0xc024aaa480] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 49 44 50 44 51 44 52 44 53 44 54 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 55 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 49 44 50 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 8 97 110 103 101 108 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2277394 dropPrefixes:[]}
W0525 17:53:10.405416      20 log.go:36] While running doCompact: sync p/MANIFEST: file already closed

Some additional info about hardware, and some metrics
zero pod:

alpha pod:

dgraph prom metrics:

our app business metrics (upserts/mutations):

@korjavin2 hey, it looks like the problem is that the MANIFEST file is being closed. The compaction modifies the SST but we are not able to write the new changes to the Manifest file. If the Manifest file doesn’t have information about the new SST, we will not consider that file.

The open question right now is that why are we closing/losing the Manifest file.
@korjavin2 can you share the complete log for zero and alpha?
Can you also share the complete config for your deployment? Are you using the same directory for two alphas by any chance?

This looks like a strange bug because we have not seen this issue before.

We have exactly on alpha and exactly one zero pod with separate disks.

I install them from helm chart.
I attach values.yaml that I use.
dgraph.yaml (17.3 KB)

@korjavin2 Can you share the complete alpha and zeros logs, please?

I attach logs

alpha.log (186.4 KB)

UPD:
It’s not really complete (they are huge)
I made an exerpt before time of accident and accident included

There were nothing in zero’s log at that time.

I reproduced this again.

Again, fresh helm installation, and there is alpha log:

00000000000011656261792e6c697374696e672e6e616d65020a20636f: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a20636f0000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 32 99 111]
W0528 10:04:05.271085      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a323031: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a3230310000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 50 48 49]
W0528 10:04:05.393619      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000016656261792e6c697374696e672e6861735f696d6167650400000000000c49f0: could not read list part with key 0400000000000000000016656261792e6c697374696e672e6861735f696d6167650400000000000c49f00000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 22 101 98 97 121 46 108 105 115 116 105 110 103 46 104 97 115 95 105 109 97 103 101 4 0 0 0 0 0 12 73 240]
W0528 10:04:06.166370      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a204e45: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a204e450000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 32 78 69]
W0528 10:04:07.019532      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000015656261792e6c697374696e672e6c6f636174696f6e020a204361: could not read list part with key 0400000000000000000015656261792e6c697374696e672e6c6f636174696f6e020a2043610000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 10 32 67 97]
I0528 10:04:09.623414      21 log.go:34] Block cache metrics: hit: 479279888 miss: 4433230323 keys-added: 210222525 keys-updated: 2219115 keys-evicted: 210107693 cost-added: 3880628428961 cost-evicted: 3879930501081 sets-dropped: 462953 sets-rejected: 7334269 gets-dropped: 11802368 gets-kept: 4720027264 gets-total: 4912510211 hit-ratio: 0.10
W0528 10:04:10.846704      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000015656261792e6c697374696e672e6c6f636174696f6e02016765726d616e79: could not read list part with key 0400000000000000000015656261792e6c697374696e672e6c6f636174696f6e02016765726d616e790000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 21 101 98 97 121 46 108 105 115 116 105 110 103 46 108 111 99 97 116 105 111 110 2 1 103 101 114 109 97 110 121]
W0528 10:04:11.445106      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000016656261792e6c697374696e672e6861735f696d61676504000000000028f8ef: could not read list part with key 0400000000000000000016656261792e6c697374696e672e6861735f696d61676504000000000028f8ef0000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 22 101 98 97 121 46 108 105 115 116 105 110 103 46 104 97 115 95 105 109 97 103 101 4 0 0 0 0 0 40 248 239]
W0528 10:04:11.982254      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a33343b: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a33343b0000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 51 52 59]
W0528 10:04:12.122931      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a6f7665: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a6f76650000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 111 118 101]
W0528 10:04:12.841819      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a303334: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a3033340000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 48 51 52]
W0528 10:04:14.334646      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 0000000000000000000011656261792e6c697374696e672e6e616d65020a207669: could not read list part with key 0400000000000000000011656261792e6c697374696e672e6e616d65020a2076690000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 32 118 105]
W0528 10:04:17.034648      21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: sync p/MANIFEST: file already closed: {span:0xc0057fcc00 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:4 score:1.4644237509096527 adjusted:2.312061263167869 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00042c3c0 nextLevel:0xc00042c420 top:[0xc03e179d40] bot:[0xc0115c1d40 0xc013279200 0xc0446a9800] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 72 83 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 110 32 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 83 107 121 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 118 97 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2113661 dropPrefixes:[]}
W0528 10:04:17.034724      21 log.go:36] While running doCompact: sync p/MANIFEST: file already closed
W0528 10:04:17.566470      21 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 000000000000000000001b656261792e6c697374696e672e736f6c645f62795f73656c6c6572040000000000001374: could not read list part with key 040000000000000000001b656261792e6c697374696e672e736f6c645f62795f73656c6c65720400000000000013740000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 27 101 98 97 121 46 108 105 115 116 105 110 103 46 115 111 108 100 95 98 121 95 115 101 108 108 101 114 4 0 0 0 0 0 0 19 116]
W0528 10:04:19.196584      21 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 3752678: {span:0xc00d154c00 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:4 score:1.370733526008833 adjusted:2.142197352896229 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00042c3c0 nextLevel:0xc00042c420 top:[0xc022974780] bot:[0xc0446a9800] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 110 38 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 118 97 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 99 105 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 118 97 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2267643 dropPrefixes:[]}
W0528 10:04:19.196665      21 log.go:36] While running doCompact: MANIFEST removes non-existing table 3752678
W0528 10:04:19.200899      21 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 3752679: {span:0xc00a749300 compactorId:1 t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:4 score:1.3729726516323733 adjusted:2.14569668292102 dropPrefixes:[] t:{baseLevel:2 targetSz:[0 10485760 10485760 16910574 169105742 1691057427 16910574274] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00042c3c0 nextLevel:0xc00042c420 top:[0xc034eff8c0] bot:[0xc013279d40 0xc0115c1d40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 80 73 80 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 72 82 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 77 105 100 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 32 86 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2230336 dropPrefixes:[]}

UPD:
I disabled google cloud persistent disk and run it just on node disks.

Nothing changed.
In 3 days I got the same issues again:

W0614 13:59:18.326329 20 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 113984: {span:0xc0097be280 compactorId:2 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.36615171432495 adjusted:21.054207908706115 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc016b760c0] bot:[0xc01302fd40 0xc00d15d5c0] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 110 102 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 114 116 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 97 103 117 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 101 108 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2124381 dropPrefixes:[]}
W0614 13:59:18.326423 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113984
W0614 13:59:18.326469 20 log.go:36] [Compactor: 0] LOG Compact FAILED with error: MANIFEST removes non-existing table 113988: {span:0xc00ca4c580 compactorId:0 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.558347129821776 adjusted:26.485386104764444 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc0155da240] bot:[0xc00d4a3ec0] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 116 108 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 101 100 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 101 109 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 119 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2015315 dropPrefixes:[]}
W0614 13:59:18.326555 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113988
W0614 13:59:18.326541 20 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 113979: {span:0xc00847e180 compactorId:3 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.556907653808594 adjusted:26.483767779090424 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc00e532000] bot:[0xc00928fd40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 83 105 114 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 97 103 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 82 101 115 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 104 101 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2000221 dropPrefixes:[]}
W0614 13:59:18.326577 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113979
W0614 13:59:18.326925 20 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 112879: {span:0xc0064b4080 compactorId:1 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:4 score:1.110077861853078 adjusted:1.116691118230626 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d31a0 nextLevel:0xc0000d3200 top:[0xc00f389b00] bot:[0xc0155dba40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 120 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 103 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 32 66 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 68 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:4005514 dropPrefixes:[]}
W0614 13:59:18.327005 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 112879
W0614 13:59:18.366050 20 log.go:36] [Compactor: 3] LOG Compact FAILED with error: MANIFEST removes non-existing table 113988: {span:0xc00847e380 compactorId:3 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.558347129821776 adjusted:26.485386104764444 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc0155da240] bot:[0xc00d4a3ec0] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 116 108 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 101 100 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 99 101 109 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 119 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2015315 dropPrefixes:[]}
W0614 13:59:18.366129 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113988
W0614 13:59:18.366108 20 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 113991: {span:0xc0097be300 compactorId:1 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.36615171432495 adjusted:21.04911062303261 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc0147c23c0] bot:[0xc00f389b00] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 85 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 100 32 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 100 111 120 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 101 32 103 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2368028 dropPrefixes:[]}
W0614 13:59:18.366203 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113991
W0614 13:59:18.366261 20 log.go:36] [Compactor: 0] LOG Compact FAILED with error: MANIFEST removes non-existing table 113979: {span:0xc00847e300 compactorId:0 t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]} p:{level:3 score:23.558347129821776 adjusted:26.485386104764444 dropPrefixes:[] t:{baseLevel:3 targetSz:[0 10485760 10485760 10485760 18157929 181579297 1815792975] fileSz:[67108864 2097152 2097152 2097152 4194304 8388608 16777216]}} thisLevel:0xc0000d3140 nextLevel:0xc0000d31a0 top:[0xc00e532000] bot:[0xc00928fd40] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 83 105 114 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 97 103 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 82 101 115 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 17 101 98 97 121 46 108 105 115 116 105 110 103 46 110 97 109 101 2 10 84 104 101 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:2000221 dropPrefixes:[]}
W0614 13:59:18.366554 20 log.go:36] While running doCompact: MANIFEST removes non-existing table 113979

Looks pretty critical to me.

I filed this issue on friday to no answer that was (unknowingly) a duplicate of this.

I completely rebuilt my production system on friday night to get around this (I have 3 shards so only one in that group is borked.) This just happened again on the new system. No restarts, brand new system as of saturday night - built from bulk loader, if that is interesting.

One of the peers just gives up with this MANIFEST removes non-existing table X and a restart will make it crashloop with the panic in the linked issue.

Here is the raft applied index of that group, you can see when the one encountered this error:

note: unlike OP I have 12 alphas, but it seems this could happen at any time. Each alpha is the only pod on a GKE VM. I have been running dgraph for over a year and have never encountered this until upgrade to v21.03.

Also see here, the memory usage of the node with this error increases where the others are low.

@ibrahim, please - there is some critical issue in badger it seems. If another peer in the same group corrupts thats probably full data loss without any further guidance.

1 Like