Critical bug in v21.12 permanently crashloops whole groups

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version

Dgraph version   : v21.12.0
Dgraph codename  : zion
Dgraph SHA-256   : 078c75df9fa1057447c8c8afc10ea57cb0a29dfb22f9e61d8c334882b4b4eb37
Commit SHA-1     : d62ed5f15
Commit timestamp : 2021-12-02 21:20:09 +0530
Branch           : HEAD
Go version       : go1.17.3
jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

k8s 15x 16c 64GiB ram on GKE

Steps to reproduce the issue (command/config used to run Dgraph).

filed Panic on tablet move v21.12.0 but seems like it is more broad than just during a tablet move. Seems like any transaction may be removed from a internal map when it should not be and will cause a crashloop that will bring down entire groups since they share transaction atomicity via raft.

2021/12/22 06:56:16 Unable to find txn with start ts: 2178082
github.com/dgraph-io/dgraph/x.AssertTruef
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func2
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:970
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1025
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581

This is just after a node restart in that group - I think a snapshot is being sent from one node to the other and it just explodes. When this happens to the whole group at once thats it - game over, have to restore from backup.

1 Like

Is it possible the txns are being prematurely deleted in this case: dgraph/draft.go at 5429202ea764975fc62bfc2683afedf312c0cf0e · dgraph-io/dgraph · GitHub

I am wondering if this txn.Cache().Deltas() is length 0 for every txn in the batch then deleteTxns() would run and if applyMutations() needs to look for that txn then it explodes. This would explain why it repeatably explodes after coming back up and reading the wal.

        c := txn.Cache()
		...
		num := len(c.Deltas())
		...
		if num == 0 {
			continue
		}
1 Like

ran into this again, this time on one group member only instead of the whole group (whew)

here is more logs and context before the issue:

I1224 16:41:31.849022       1 log.go:34] L0 was stalled for 31.524s
I1224 16:41:32.699105       1 draft.go:915] Proposal with key: 18288822768274597043 already applied. Skipping index: 1673742. Delta: txns:<start_ts:2287950 commit_ts:2288020 > tx
I1224 16:41:32.700345       1 draft.go:915] Proposal with key: 18288822768274597046 already applied. Skipping index: 1673746. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.701011       1 draft.go:915] Proposal with key: 18288822768274597043 already applied. Skipping index: 1673748. Delta: txns:<start_ts:2287950 commit_ts:2288020 > tx
I1224 16:41:32.701871       1 draft.go:915] Proposal with key: 18288822768274597047 already applied. Skipping index: 1673751. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.703620       1 draft.go:915] Proposal with key: 18288822768274597046 already applied. Skipping index: 1673756. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.704166       1 draft.go:915] Proposal with key: 18288822768274597048 already applied. Skipping index: 1673757. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.705603       1 draft.go:915] Proposal with key: 18288822768274597049 already applied. Skipping index: 1673759. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.705908       1 draft.go:915] Proposal with key: 18288822768274597050 already applied. Skipping index: 1673760. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.706153       1 draft.go:915] Proposal with key: 18288822768274597051 already applied. Skipping index: 1673761. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708291       1 draft.go:915] Proposal with key: 18288822768274597044 already applied. Skipping index: 1673743. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708315       1 draft.go:915] Proposal with key: 18288822768274597045 already applied. Skipping index: 1673745. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708328       1 draft.go:915] Proposal with key: 18288822768274597044 already applied. Skipping index: 1673749. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708336       1 draft.go:915] Proposal with key: 18288822768274597045 already applied. Skipping index: 1673750. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.712847       1 draft.go:915] Proposal with key: 18288822768274597053 already applied. Skipping index: 1673783. Delta: txns:<start_ts:2287769 > txns:<start_ts:22843
I1224 16:41:32.713736       1 draft.go:915] Proposal with key: 18288822768274597054 already applied. Skipping index: 1673785. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.715434       1 draft.go:915] Proposal with key: 18288822768274597055 already applied. Skipping index: 1673788. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.738477       1 draft.go:915] Proposal with key: 14419583477275073265 already applied. Skipping index: 1673792. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.742548       1 draft.go:1321] Blocked pushing to raft node applyCh for 15.121s
I1224 16:41:32.760963       1 draft.go:915] Proposal with key: 18288822768274597077 already applied. Skipping index: 1673829. Delta: txns:<start_ts:2288010 > txns:<start_ts:22880
I1224 16:41:32.761854       1 draft.go:915] Proposal with key: 18288822768274597078 already applied. Skipping index: 1673831. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.771756       1 draft.go:915] Proposal with key: 18288822768274597080 already applied. Skipping index: 1673843. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.772883       1 draft.go:915] Proposal with key: 18288822768274597082 already applied. Skipping index: 1673845. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.776450       1 draft.go:915] Proposal with key: 18288822768274597084 already applied. Skipping index: 1673852. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.777146       1 draft.go:915] Proposal with key: 18288822768274597086 already applied. Skipping index: 1673854. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.777766       1 draft.go:915] Proposal with key: 18288822768274597087 already applied. Skipping index: 1673855. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.778627       1 draft.go:915] Proposal with key: 18288822768274597088 already applied. Skipping index: 1673856. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.780834       1 draft.go:915] Proposal with key: 18288822768274597089 already applied. Skipping index: 1673859. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.782344       1 draft.go:915] Proposal with key: 18288822768274597090 already applied. Skipping index: 1673861. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.783819       1 draft.go:915] Proposal with key: 18288822768274597091 already applied. Skipping index: 1673863. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.786879       1 draft.go:915] Proposal with key: 18288822768274597092 already applied. Skipping index: 1673867. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.788827       1 draft.go:915] Proposal with key: 18288822768274597093 already applied. Skipping index: 1673870. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.789957       1 draft.go:915] Proposal with key: 18288822768274597094 already applied. Skipping index: 1673872. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.791901       1 draft.go:915] Proposal with key: 18288822768274597095 already applied. Skipping index: 1673875. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.792583       1 draft.go:915] Proposal with key: 18288822768274597096 already applied. Skipping index: 1673876. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.793938       1 draft.go:915] Proposal with key: 18288822768274597097 already applied. Skipping index: 1673878. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.795227       1 draft.go:915] Proposal with key: 18288822768274597098 already applied. Skipping index: 1673880. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.797911       1 draft.go:915] Proposal with key: 18288822768274597100 already applied. Skipping index: 1673884. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.798537       1 draft.go:915] Proposal with key: 18288822768274597099 already applied. Skipping index: 1673885. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.802227       1 draft.go:915] Proposal with key: 18288822768274597101 already applied. Skipping index: 1673891. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.802873       1 draft.go:915] Proposal with key: 18288822768274597102 already applied. Skipping index: 1673892. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.807132       1 draft.go:915] Proposal with key: 18288822768274597103 already applied. Skipping index: 1673899. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.808630       1 draft.go:915] Proposal with key: 18288822768274597104 already applied. Skipping index: 1673900. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.814762       1 draft.go:915] Proposal with key: 18288822768274597105 already applied. Skipping index: 1673903. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.819607       1 draft.go:915] Proposal with key: 18288822768274597106 already applied. Skipping index: 1673909. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.839271       1 draft.go:915] Proposal with key: 18288822768274597107 already applied. Skipping index: 1673918. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.840098       1 draft.go:915] Proposal with key: 18288822768274597108 already applied. Skipping index: 1673919. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.840986       1 draft.go:915] Proposal with key: 18288822768274597109 already applied. Skipping index: 1673920. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.844175       1 draft.go:915] Proposal with key: 14419583477275073266 already applied. Skipping index: 1673924. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.844578       1 draft.go:915] Proposal with key: 14419583477275073267 already applied. Skipping index: 1673925. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.846978       1 draft.go:915] Proposal with key: 14419583477275073268 already applied. Skipping index: 1673928. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.847740       1 draft.go:915] Proposal with key: 14419583477275073269 already applied. Skipping index: 1673929. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.850774       1 draft.go:915] Proposal with key: 14419583477275073270 already applied. Skipping index: 1673932. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878861       1 draft.go:915] Proposal with key: 18288822768274597079 already applied. Skipping index: 1673832. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878896       1 draft.go:915] Proposal with key: 18288822768274597081 already applied. Skipping index: 1673842. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878905       1 draft.go:915] Proposal with key: 18288822768274597083 already applied. Skipping index: 1673849. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878917       1 draft.go:915] Proposal with key: 18288822768274597085 already applied. Skipping index: 1673853. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.884219       1 draft.go:915] Proposal with key: 14419583477275073271 already applied. Skipping index: 1673944. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.556245       1 draft.go:915] Proposal with key: 18251741697359166876 already applied. Skipping index: 1674045. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.556499       1 draft.go:915] Proposal with key: 18251741697359166877 already applied. Skipping index: 1674046. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.611549       1 draft.go:915] Proposal with key: 18251741697359166878 already applied. Skipping index: 1674099. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.611812       1 draft.go:915] Proposal with key: 18251741697359166876 already applied. Skipping index: 1674100. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.612103       1 draft.go:915] Proposal with key: 18251741697359166877 already applied. Skipping index: 1674101. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.612156       1 draft.go:915] Proposal with key: 18251741697359166878 already applied. Skipping index: 1674102. Delta: <nil> Snapshot: <nil>.
2021/12/24 16:41:33 Unable to find txn with start ts: 2287584
github.com/dgraph-io/dgraph/x.AssertTruef
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1020
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581

I removed the node that was damaged by the last panic (alpha-10) and re-added him with completely fresh storage. Here is what happened:

  • crashloops
  • removeNode
  • new node comes up, gets assigned to free spot in that group
  • gets snapshot from leader
  • after snapshot, gets txns sent to him from snapshotTS->now
  • panics at same timestamp as before

see cutely annotated screenshot of this in grafana:


(update: and a second removeNode and re-add done after the next snapshot worked as expected, but this highlights that whatever this issue is, is committed to the raft wal.)

happened again, two members of a 3 member group panic over the same txn, but not the third member - I was able to removeNode on 2 members and currently re-adding them one at a time.

edit: third one encountered this during snapshot send so lost the whole cluster.

I have reduced our ingestion to 2 threads and have not encountered the issue since (after having to restore from backup 3 times Christmas day) So it seems to be related to disk contention/io throttling.

This is why we were so effected by the badgerdb manifest corruption too - our disks are extremely throttled (gcp pd-ssd @2TB, second highest tier of IOPS on pd-ssd on GCP) and even 2 threads inserting is enough to trigger throttling on our disks.

What configuration does dgraph cloud use on GCP to avoid throttling? I mean the real issue needs to be solved in Dgraph but I am tired of rebuilding my cluster. At least backup/restore is OSS now - if I had to bulk load each of these times it would be even more unpalatable.

just an update here: The problems with v21.12.0 were just too much to handle and I had to roll back to a v21.03.2+manifest corruption patch. We found divergent data on peers in a group and had to call it quits after that.

  • This new corruption/panic happens somewhere around the oracle keeping track of pending mutations and I feel like I am really close with my research in this comment but I need someone with more knowledge of this part of the codebase to validate.
  • The query performance of v21.12.0 is fantastic - unless you are inserting data - then it is incredibly slow. Rolling back to v21.03 has fixed this, query times during ingestion are back to ~150ms 50th percentile, as opposed to 2s-4s. See V21.12 slow queries
  • Please release a v21.03.4 with badger fixes if you can.
  • I am so glad backup+restore was open sourced. I really want to use it on v21.03 until v21.12 is fixed… maybe switch it over to OSS in a v21.03.4 release?