There are 3 groups of 3 replicates, and one node has the following error, the query is no longer responding, causing a block.How should I deal with it?
The dgraph team tried to reproduce it on dgraph cloud and stated they could not.
That peer is corrupt now - to remove him and add a new peer: run /removeNode on the zero specifying the messed up peer, wipe out his state and bring him back with new state. Instructions are here (they are k8s specific but basically do as above).
This corruption issue has hit us a lot and is part of our bi-weekly issues with running dgraph ourselves. Removing nodes from a group also, in our case, caused them to stay in the raft membership and messed up our cluster which required forking dgraph to fix, but the dgraph team has never even responded to that.
Do you have a specific question? If so you will have to share what you have done.
Also a little confused on your above output based on you saying:
but the commands are obviously the v20.11.x flags (and the binary is called dgraphv20113).
Making a lot of assumptions here, I would guess when you removed a peer and ran removeNode/ you may not have removed the p, w directories for that peer before restarting it. The reason I say this is the raft ID is given by the zero servers to new clean alpha servers that do not specify a raft ID of their own on the command line, or have it written down in their p/w directories.
We encountered a very similar situation, the 3 member cluster became non responsive, while the leader was trying to compact and mostly working on (bugged down by) the same two files. Here are the warning messages, showing up every second (or very two seconds):
While running doCompact: MANIFEST removes non-existing table 8603827
...
While running doCompact: MANIFEST removes non-existing table 8603684
I can see those two files under the /var/lib/dgraph/p/ , and sitting there for 5 hours so far…
I would love to remove those two files, and see if the series of complaints continue!
Or should we mark the leader as sick, remove it from the cluster, allow the other two members to elect a leader, clear up the data on the faulty sick member, and add it back into the cluster? We are still at consultation level. How did you guys fix this problem in the past? Thanks, Steve
The only way I know of to deal with this is treat the node as corrupt and follow node removal/re-add procedures.
Many people see this issue and I am just hoping beyond hope that v21.09 will fix it somehow. Note: it probably wont as this was not specifically addressed, but I am still hopeful
Thank you so much for your reply, Kenan! I was hoping that I can somehow use the badger CLI to tell the app: please disregard these two files, please continue with Compacting and so on and so on (to make our 60,000 users happy!)
I wish. Unfortunately dgraph labs proper have reportedly not been able to reproduce this issue at all, and so no official direction on mitigation or repair has been given. The only saving grace is the design of dgraph allows a corrupt node without losing anything (assuming you are running the proper shardreplicas=3 setup), and node removal-readding takes only a minute without impact to the application.
When LOG Compact FAILED with error: MANIFEST removes non-existing table happend on the leader, the whole cluster did not server data, and our favorite app stopped! Sadly I am awaiting for the next incident, and I am watching for the Nagios alert!
yea, unfortunately I have noticed that requests still get routed to the effected peer and will stall an app until it is turned off. Or something like that but yea… it is bad.
Yeah, it happened again after two days, at 2:04 AM. At least we caught it with nagios immediately. This time it was not on the leader, so the other two nodes continued to serve data, and the app is still up and running. I am wondering if we should downgrade from dgraph 21.03.0 to some stable version. Anything, just stable. PS. I noticed that our dgraph nodes run run constantly at ~600% out 1600% (16 CPU VMs). Maybe it happens during high load, the little badger can’t compact well and loses the tables? Little badger It’s hard to be a badger in dgraph! W0917 02:36:56.361629 3018868 log.go:36] While running doCompact: MANIFEST removes non-existing table W0917 02:36:56.416721 3018868 log.go:36] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table...
@Steve_Baroti it has been happening to us for several major versions, I do not think downgrading will let you escape from this. Especially since the only other supported major version is v20.11.x, and it’s support expires in 3 months.
Our cluster is an extremely heavy constant write scenario which I could imagine triggering some untested case by dgraph labs.
I have seen no mention from dgraph proper that any further effort has been put in to rectify this other than this attempt:
But so many people make issues on here saying this is a consistent problem, I just cannot imagine it can be ignored indefinitely:
Meditated about this issue, we started storing less data, we sharded the data, multiplied by two the cluster size, and for now the system seems more stable. Tip 58 from the Pragmatic Programmer comes in my mind: “Random Failures Are Often Concurrency Issues”. Will update this post if significative events will occur - fingers crossed. Spoke to soon, it happened again, and one node is out of whack dgraph[2197758]: W0924 12:57:02.928687 2197825 log.go:36] While running doCompact: MANIFEST removes non-existing table 18308720
not good for a DB to have unanswered corruption problems, it is kind of its only job.
I totally agree its probably a race during compaction - but I have tried looking at the badger code and it keeps all the tables in a mutex locked map - I am not quite sure how it could get confused about which ones are there
From a three nodes cluster, our head devops expanded the cluster size to 6 nodes (2x 3) and we still recorded compacting fatal errors (on the leader too, so the whole cluster was unresponsive). Good news: five days ago the cluster was expanded to 10 nodes (2x 5) and finally we had some stability, no compacting errors in these 5 days, and CPU usage went down during off-peak hours. It made think that for stability reasons, the whole data center may run one day a huge dgraph cluster. Super horizontal scaling may be the answer! :
Our cluster nodes have 16 cores and 24GiB. Before expanding the cluster, we had a solid 600% CPU usage (day & night) on all nodes, with a reported 400 IOPs during the day, with 2-3ms IOwait. Only after the cluster expansion to 10 nodes, CPU usage went eventually down during off-hours, still with spikes to 600% during business hours, but not solid 600% like before. Wondering how we can tune-up the badger and his frineds, allocated CPU and memory wise, to finish faster the compaction