I was getting nervous with one peer of the group down so I put the result of the bulkload p directory for that group back in the place of the corrupted p directory, and the leader sent a snapshot to that peer - I think thats ok…
This does not solve the real issue here but maybe helped my cluster become normal. Or, I have messed everything up and do not know it.
edit: ok that was a horrible idea, that server was then missing data after it came back and looked healthy, I think it was missing all txns from bulkload->when it corrupted. Meaning the snapshot only applied after the corruption to current.
@mrjn thanks for a response -
I have a hunch it has to do with massive insert frequency, possibly number of transactions completed in a very short window - it has happened to me twice in the past week and does so directly after a complete system rebuild.
During a rebuild, I have to turn off our ingestion as to create a backlog. When the system is completely rebuilt hours later, the backlog is possibly millions and is inserted very fast - that is my only hint to a cause. I dont know if it will help but I have not rebuilt my system after the last corruption, and I have the corrupted p directory, I could offer you the MANIFEST file or others if it would help any. I plan to rebuild tonight to fix this corruption.
Is there any way to recover a corrupted peer? Only idea I had was to remove the peer by ID and readd him with no state and a different id - but that is really extreme in the case of running in k8s, where the id comes from the statesfulset pod ordinal, and would become unsupportable quick.
Is it possible for you to try this workload out on Dgraph Cloud? That ways we can monitor the backend and see if we encounter this issue. We can probably give you a trial instance for a week to help replicate this issue. CC: @dmai
@iluminae We can set you up with a trial instance if you’d like.
You can call /removeNode following these K8s-specific steps. Alphas can get an auto-assigned Raft ID, so it’s not baked as part of the statefulset ordinal.
oh man @dmai that is exactly what I was looking for. I dont know where I was thinking the ordinal was the raft id… maybe that was a long time ago (or a different db, idk). Having the proceedure is really what I needed here. I will do this instead of rebuilding my whole cluster tonight.
@mrjn I would totally be game with forking my prod data to insert to dgraph cloud as well, but I have to figure out if that is kosher since it is my customer’s data. (maybe I can build a value scrambler into the insertion pipeline and have that be ok)
(marked that comment above about the procedure to rebuild a broken node in k8s as that was my immediate concern)
@dmai I would love to help with reproducing on your side, I have to figure out if its ok to ship out that kind of metadata or if I will have to obscure it. I will pm you when I figure that out.
Yea you should instead set up a dedicated instance for @korjavin2 since he has reproduced it more than I have and I am not positive if I can send all of the metadata as is.
Just happend to one of my alphas again, this time on a leader, which seems to have been far worse as it blocked queries until I killed the node and one of the other ones in the group took over as leader.
Since @korjavin2 and @iluminae have repeatedly seen the issue, it might be useful to add more verbose logging in badger compactions to figure out why/when was this particular table was deleted.
Also, if this happens again, please save the logs of Alpha that’s acting weird and collect the output of badger info --dir /path/to/pdir (note: You might need to set the readonly=false in info command if alpha didn’t shut down properly).
So, the cluster ran for a week in the cloud. We didn’t see this particular file does not exist issue. In fact, we just ran dgraph increment on the cluster, and it’s working fine.
We did happen to see a posting list rollup issue, which we had seen elsewhere too. We will debug that this week.