LOG Compact FAILED with error: MANIFEST removes non-existing table 15777621,

There are 3 groups of 3 replicates, and one node has the following error, the query is no longer responding, causing a block.How should I deal with it?

W0716 14:37:32.990118 16346 log.go:36] [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 15777621: {span:0xc00ceb4700 compactorId:2 t:{baseLevel:2 targetSz:[0 10485760 10485760 30006262 300062624 3000626242 30006262429] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:2 score:3.043694591522217 adjusted:3.254670825337362 dropPrefixes: t:{baseLevel:2 targetSz:[0 10485760 10485760 30006262 300062624 3000626242 30006262429] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc00078e420 nextLevel:0xc00078e480 top:[0xc042cd72c0] bot:[0xc05441b440] thisRange:{left:[4 0 9 114 111 111 116 85 103 99 73 100 2 6 1 0 0 0 0 0 0 0 1 8 83 160 210 61 139 110 246 0 0 0 0 0 0 0 0] right:[4 0 9 114 111 111 116 85 103 99 73 100 2 6 1 0 0 0 0 0 0 0 1 8 83 160 210 61 176 39 145 255 255 255 255 255 255 255 255] inf:false} nextRange:{left:[4 0 9 114 111 111 116 85 103 99 73 100 2 6 1 0 0 0 0 0 0 0 1 8 83 160 210 61 139 110 246 0 0 0 0 0 0 0 0] right:[4 0 9 114 111 111 116 85 103 99 73 100 2 6 1 0 0 0 0 0 0 0 1 8 83 160 210 62 74 38 126 255 255 255 255 255 255 255 255] inf:false} splits: thisSize:2563487 dropPrefixes:}
W0716 14:37:32.990209 16346 log.go:36] While running doCompact: MANIFEST removes non-existing table 15777621

The physical machine deploys the non-K8S cluster, using version V21.03, and data inserts are relatively frequent.

nohup ./dgraphv20113 zero  --my 10.x.19.91:5080 --replicas 3 >> zero.log &
nohup ./dgraphv20113 alpha  --whitelist 10.x.0.0:10.5.0.0 -o 0  --ludicrous_mode  --my 10.x.19.91:7080 --zero 10.x.19.91:5080  --pending_proposals 32 -p p0 -w w0 >> nohup0.out &
nohup ./dgraphv20113 alpha  --whitelist 10.x.0.0:10.5.0.0 -o 1  --ludicrous_mode  --my 10.x.19.91:7081 --zero 10.x.19.91:5080  --pending_proposals 32 -p p1 -w w1 >> nohup1.out &
nohup ./dgraphv20113 alpha  --whitelist 10.x.0.0:10.5.0.0 -o 2  --ludicrous_mode  --my 10.x.19.91:7082 --zero 10.x.19.91:5080  --pending_proposals 32 -p p2 -w w2 >> nohup2.out &

This is the same as this and this.

The dgraph team tried to reproduce it on dgraph cloud and stated they could not.

That peer is corrupt now - to remove him and add a new peer: run /removeNode on the zero specifying the messed up peer, wipe out his state and bring him back with new state. Instructions are here (they are k8s specific but basically do as above).

This corruption issue has hit us a lot and is part of our bi-weekly issues with running dgraph ourselves. Removing nodes from a group also, in our case, caused them to stay in the raft membership and messed up our cluster which required forking dgraph to fix, but the dgraph team has never even responded to that.

thk!

@iluminae
I0719 14:06:37.313561 16173 run.go:553] HTTP server started. Listening on port 8080
I0719 14:06:37.411412 16173 pool.go:162] CONNECTING to 10.4.19.91:5080
[Sentry] 2021/07/19 14:06:37 Sending fatal event [43643ba1d1d446039d16ab8d21b973ff] to o318308.ingest.sentry.io project: 1805390
2021/07/19 14:06:37 rpc error: code = Unknown desc = REUSE_RAFTID: Duplicate Raft ID 1 to removed member: id:1 group_id:1 addr:“10.4.19.91:7080” last_update:1617349861

Do you have a specific question? If so you will have to share what you have done.

Also a little confused on your above output based on you saying:

but the commands are obviously the v20.11.x flags (and the binary is called dgraphv20113).

Making a lot of assumptions here, I would guess when you removed a peer and ran removeNode/ you may not have removed the p, w directories for that peer before restarting it. The reason I say this is the raft ID is given by the zero servers to new clean alpha servers that do not specify a raft ID of their own on the command line, or have it written down in their p/w directories.

the problem with dgraphv20113 operation staff comment is actually V21.03
Now we can start the new alpha and feel your reply very much!

We encountered a very similar situation, the 3 member cluster became non responsive, while the leader was trying to compact and mostly working on (bugged down by) the same two files. Here are the warning messages, showing up every second (or very two seconds):

While running doCompact: MANIFEST removes non-existing table 8603827
...
While running doCompact: MANIFEST removes non-existing table 8603684

I can see those two files under the /var/lib/dgraph/p/ , and sitting there for 5 hours so far…

/var/lib/dgraph/p/8603827.sst (9:11 - Sep 13 09:11)
/var/lib/dgraph/p/8603684.sst (9:11 - Sep 13 09:11)

The only relevant errors before 9:11 are:

I0913 08:55:29.157530    7078 draft.go:1530] Found 1 old transactions. Acting to abort them.
I0913 08:55:29.159236    7078 draft.go:1491] TryAbort 1 txns with start ts. Error: <nil>
I0913 08:55:29.159249    7078 draft.go:1514] TryAbort selectively proposing only aborted txns: txns:<start_ts:24886942 >
I0913 08:55:29.160094    7078 draft.go:1533] Done abortOldTransactions for 1 txns. Error: <nil>

And at 9:11, kaboom!

 W0913 09:11:09.175527    7078 log.go:36] [Compactor: 2] LOG Compact FAILED with error: sync /var/lib/dgraph/p/MANIFEST: file already closed:
W0913 09:11:09.181562    7078 log.go:36] While running doCompact: sync /var/lib/dgraph/p/MANIFEST: file already closed
 W0913 09:11:09.228910    7078 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 8603684
W0913 09:11:09.229006    7078 log.go:36] While running doCompact: MANIFEST removes non-existing table 8603684

I would love to remove those two files, and see if the series of complaints continue! :heart:
Or should we mark the leader as sick, remove it from the cluster, allow the other two members to elect a leader, clear up the data on the faulty sick member, and add it back into the cluster? We are still at consultation level. How did you guys fix this problem in the past? Thanks, Steve

2 Likes

The only way I know of to deal with this is treat the node as corrupt and follow node removal/re-add procedures.

Many people see this issue and I am just hoping beyond hope that v21.09 will fix it somehow. Note: it probably wont as this was not specifically addressed, but I am still hopeful :slight_smile:

2 Likes

Thank you so much for your reply, Kenan! I was hoping that I can somehow use the badger CLI to tell the app: please disregard these two files, please continue with Compacting and so on and so on (to make our 60,000 users happy!)

1 Like

I wish. Unfortunately dgraph labs proper have reportedly not been able to reproduce this issue at all, and so no official direction on mitigation or repair has been given. The only saving grace is the design of dgraph allows a corrupt node without losing anything (assuming you are running the proper shardreplicas=3 setup), and node removal-readding takes only a minute without impact to the application.

But I run into this weekly, sometimes bi-weekly.

3 Likes

When LOG Compact FAILED with error: MANIFEST removes non-existing table happend on the leader, the whole cluster did not server data, and our favorite app stopped! Sadly I am awaiting for the next incident, and I am watching for the Nagios alert! :crazy_face: :crossed_fingers:

yea, unfortunately I have noticed that requests still get routed to the effected peer and will stall an app until it is turned off. Or something like that but yea… it is bad.

1 Like

Yeah, it happened again after two days, at 2:04 AM. At least we caught it with nagios immediately. This time it was not on the leader, so the other two nodes continued to serve data, and the app is still up and running. I am wondering if we should downgrade from dgraph 21.03.0 to some stable version. Anything, just stable. :pray: PS. I noticed that our dgraph nodes run run constantly at ~600% out 1600% (16 CPU VMs). Maybe it happens during high load, the little badger can’t compact well and loses the tables? Little badger :badger: :badger: :badger: It’s hard to be a badger in dgraph! :slight_smile:
W0917 02:36:56.361629 3018868 log.go:36] While running doCompact: MANIFEST removes non-existing table
W0917 02:36:56.416721 3018868 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table...

@Steve_Baroti it has been happening to us for several major versions, I do not think downgrading will let you escape from this. Especially since the only other supported major version is v20.11.x, and it’s support expires in 3 months.

Our cluster is an extremely heavy constant write scenario which I could imagine triggering some untested case by dgraph labs.

I have seen no mention from dgraph proper that any further effort has been put in to rectify this other than this attempt:

But so many people make issues on here saying this is a consistent problem, I just cannot imagine it can be ignored indefinitely:

2 Likes

Meditated about this issue, we started storing less data, we sharded the data, multiplied by two the cluster size, and for now the system seems more stable. Tip 58 from the Pragmatic Programmer comes in my mind: “Random Failures Are Often Concurrency Issues”. Will update this post if significative events will occur - fingers crossed.
:crystal_ball: Spoke to soon, it happened again, and one node is out of whack :crying_cat_face: dgraph[2197758]: W0924 12:57:02.928687 2197825 log.go:36] While running doCompact: MANIFEST removes non-existing table 18308720

:frowning: not good for a DB to have unanswered corruption problems, it is kind of its only job.

I totally agree its probably a race during compaction - but I have tried looking at the badger code and it keeps all the tables in a mutex locked map - I am not quite sure how it could get confused about which ones are there :confused:

2 Likes

From a three nodes cluster, our head devops expanded the cluster size to 6 nodes (2x 3) and we still recorded compacting fatal errors (on the leader too, so the whole cluster was unresponsive). Good news: five days ago the cluster was expanded to 10 nodes (2x 5) and finally we had some stability, no compacting errors in these 5 days, and CPU usage went down during off-peak hours. It made think that for stability reasons, the whole data center may run one day a huge dgraph cluster. Super horizontal scaling may be the answer! :face_with_raised_eyebrow: :

I have 15 nodes each with 16cores 32GiB each in our production cluster and we see this bi-weekly. :frowning:

Our cluster nodes have 16 cores and 24GiB. Before expanding the cluster, we had a solid 600% CPU usage (day & night) on all nodes, with a reported 400 IOPs during the day, with 2-3ms IOwait. Only after the cluster expansion to 10 nodes, CPU usage went eventually down during off-hours, still with spikes to 600% during business hours, but not solid 600% like before. Wondering how we can tune-up the badger and his frineds, allocated CPU and memory wise, to finish faster the compaction :thinking: