iluminae
(Kenan Kessler)
December 22, 2021, 7:02am
1
Report a Dgraph Bug
What version of Dgraph are you using?
Dgraph Version
$ dgraph version
Dgraph version : v21.12.0
Dgraph codename : zion
Dgraph SHA-256 : 078c75df9fa1057447c8c8afc10ea57cb0a29dfb22f9e61d8c334882b4b4eb37
Commit SHA-1 : d62ed5f15
Commit timestamp : 2021-12-02 21:20:09 +0530
Branch : HEAD
Go version : go1.17.3
jemalloc enabled : true
Have you tried reproducing the issue with the latest release?
yes
What is the hardware spec (RAM, OS)?
k8s 15x 16c 64GiB ram on GKE
Steps to reproduce the issue (command/config used to run Dgraph).
filed Panic on tablet move v21.12.0 but seems like it is more broad than just during a tablet move. Seems like any transaction may be removed from a internal map when it should not be and will cause a crashloop that will bring down entire groups since they share transaction atomicity via raft.
2021/12/22 06:56:16 Unable to find txn with start ts: 2178082
github.com/dgraph-io/dgraph/x.AssertTruef
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func2
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:970
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1025
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1581
This is just after a node restart in that group - I think a snapshot is being sent from one node to the other and it just explodes. When this happens to the whole group at once thats it - game over, have to restore from backup.
1 Like
iluminae
(Kenan Kessler)
December 22, 2021, 2:53pm
2
Is it possible the txns are being prematurely deleted in this case: dgraph/draft.go at 5429202ea764975fc62bfc2683afedf312c0cf0e · dgraph-io/dgraph · GitHub
I am wondering if this txn.Cache().Deltas()
is length 0 for every txn in the batch then deleteTxns()
would run and if applyMutations()
needs to look for that txn then it explodes. This would explain why it repeatably explodes after coming back up and reading the wal.
c := txn.Cache()
...
num := len(c.Deltas())
...
if num == 0 {
continue
}
1 Like
iluminae
(Kenan Kessler)
December 24, 2021, 4:48pm
4
ran into this again, this time on one group member only instead of the whole group (whew)
here is more logs and context before the issue:
I1224 16:41:31.849022 1 log.go:34] L0 was stalled for 31.524s
I1224 16:41:32.699105 1 draft.go:915] Proposal with key: 18288822768274597043 already applied. Skipping index: 1673742. Delta: txns:<start_ts:2287950 commit_ts:2288020 > tx
I1224 16:41:32.700345 1 draft.go:915] Proposal with key: 18288822768274597046 already applied. Skipping index: 1673746. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.701011 1 draft.go:915] Proposal with key: 18288822768274597043 already applied. Skipping index: 1673748. Delta: txns:<start_ts:2287950 commit_ts:2288020 > tx
I1224 16:41:32.701871 1 draft.go:915] Proposal with key: 18288822768274597047 already applied. Skipping index: 1673751. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.703620 1 draft.go:915] Proposal with key: 18288822768274597046 already applied. Skipping index: 1673756. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.704166 1 draft.go:915] Proposal with key: 18288822768274597048 already applied. Skipping index: 1673757. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.705603 1 draft.go:915] Proposal with key: 18288822768274597049 already applied. Skipping index: 1673759. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.705908 1 draft.go:915] Proposal with key: 18288822768274597050 already applied. Skipping index: 1673760. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.706153 1 draft.go:915] Proposal with key: 18288822768274597051 already applied. Skipping index: 1673761. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708291 1 draft.go:915] Proposal with key: 18288822768274597044 already applied. Skipping index: 1673743. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708315 1 draft.go:915] Proposal with key: 18288822768274597045 already applied. Skipping index: 1673745. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708328 1 draft.go:915] Proposal with key: 18288822768274597044 already applied. Skipping index: 1673749. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.708336 1 draft.go:915] Proposal with key: 18288822768274597045 already applied. Skipping index: 1673750. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.712847 1 draft.go:915] Proposal with key: 18288822768274597053 already applied. Skipping index: 1673783. Delta: txns:<start_ts:2287769 > txns:<start_ts:22843
I1224 16:41:32.713736 1 draft.go:915] Proposal with key: 18288822768274597054 already applied. Skipping index: 1673785. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.715434 1 draft.go:915] Proposal with key: 18288822768274597055 already applied. Skipping index: 1673788. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.738477 1 draft.go:915] Proposal with key: 14419583477275073265 already applied. Skipping index: 1673792. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.742548 1 draft.go:1321] Blocked pushing to raft node applyCh for 15.121s
I1224 16:41:32.760963 1 draft.go:915] Proposal with key: 18288822768274597077 already applied. Skipping index: 1673829. Delta: txns:<start_ts:2288010 > txns:<start_ts:22880
I1224 16:41:32.761854 1 draft.go:915] Proposal with key: 18288822768274597078 already applied. Skipping index: 1673831. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.771756 1 draft.go:915] Proposal with key: 18288822768274597080 already applied. Skipping index: 1673843. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.772883 1 draft.go:915] Proposal with key: 18288822768274597082 already applied. Skipping index: 1673845. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.776450 1 draft.go:915] Proposal with key: 18288822768274597084 already applied. Skipping index: 1673852. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.777146 1 draft.go:915] Proposal with key: 18288822768274597086 already applied. Skipping index: 1673854. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.777766 1 draft.go:915] Proposal with key: 18288822768274597087 already applied. Skipping index: 1673855. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.778627 1 draft.go:915] Proposal with key: 18288822768274597088 already applied. Skipping index: 1673856. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.780834 1 draft.go:915] Proposal with key: 18288822768274597089 already applied. Skipping index: 1673859. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.782344 1 draft.go:915] Proposal with key: 18288822768274597090 already applied. Skipping index: 1673861. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.783819 1 draft.go:915] Proposal with key: 18288822768274597091 already applied. Skipping index: 1673863. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.786879 1 draft.go:915] Proposal with key: 18288822768274597092 already applied. Skipping index: 1673867. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.788827 1 draft.go:915] Proposal with key: 18288822768274597093 already applied. Skipping index: 1673870. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.789957 1 draft.go:915] Proposal with key: 18288822768274597094 already applied. Skipping index: 1673872. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.791901 1 draft.go:915] Proposal with key: 18288822768274597095 already applied. Skipping index: 1673875. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.792583 1 draft.go:915] Proposal with key: 18288822768274597096 already applied. Skipping index: 1673876. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.793938 1 draft.go:915] Proposal with key: 18288822768274597097 already applied. Skipping index: 1673878. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.795227 1 draft.go:915] Proposal with key: 18288822768274597098 already applied. Skipping index: 1673880. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.797911 1 draft.go:915] Proposal with key: 18288822768274597100 already applied. Skipping index: 1673884. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.798537 1 draft.go:915] Proposal with key: 18288822768274597099 already applied. Skipping index: 1673885. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.802227 1 draft.go:915] Proposal with key: 18288822768274597101 already applied. Skipping index: 1673891. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.802873 1 draft.go:915] Proposal with key: 18288822768274597102 already applied. Skipping index: 1673892. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.807132 1 draft.go:915] Proposal with key: 18288822768274597103 already applied. Skipping index: 1673899. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.808630 1 draft.go:915] Proposal with key: 18288822768274597104 already applied. Skipping index: 1673900. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.814762 1 draft.go:915] Proposal with key: 18288822768274597105 already applied. Skipping index: 1673903. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.819607 1 draft.go:915] Proposal with key: 18288822768274597106 already applied. Skipping index: 1673909. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.839271 1 draft.go:915] Proposal with key: 18288822768274597107 already applied. Skipping index: 1673918. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.840098 1 draft.go:915] Proposal with key: 18288822768274597108 already applied. Skipping index: 1673919. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.840986 1 draft.go:915] Proposal with key: 18288822768274597109 already applied. Skipping index: 1673920. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.844175 1 draft.go:915] Proposal with key: 14419583477275073266 already applied. Skipping index: 1673924. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.844578 1 draft.go:915] Proposal with key: 14419583477275073267 already applied. Skipping index: 1673925. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.846978 1 draft.go:915] Proposal with key: 14419583477275073268 already applied. Skipping index: 1673928. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.847740 1 draft.go:915] Proposal with key: 14419583477275073269 already applied. Skipping index: 1673929. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.850774 1 draft.go:915] Proposal with key: 14419583477275073270 already applied. Skipping index: 1673932. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878861 1 draft.go:915] Proposal with key: 18288822768274597079 already applied. Skipping index: 1673832. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878896 1 draft.go:915] Proposal with key: 18288822768274597081 already applied. Skipping index: 1673842. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878905 1 draft.go:915] Proposal with key: 18288822768274597083 already applied. Skipping index: 1673849. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.878917 1 draft.go:915] Proposal with key: 18288822768274597085 already applied. Skipping index: 1673853. Delta: <nil> Snapshot: <nil>.
I1224 16:41:32.884219 1 draft.go:915] Proposal with key: 14419583477275073271 already applied. Skipping index: 1673944. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.556245 1 draft.go:915] Proposal with key: 18251741697359166876 already applied. Skipping index: 1674045. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.556499 1 draft.go:915] Proposal with key: 18251741697359166877 already applied. Skipping index: 1674046. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.611549 1 draft.go:915] Proposal with key: 18251741697359166878 already applied. Skipping index: 1674099. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.611812 1 draft.go:915] Proposal with key: 18251741697359166876 already applied. Skipping index: 1674100. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.612103 1 draft.go:915] Proposal with key: 18251741697359166877 already applied. Skipping index: 1674101. Delta: <nil> Snapshot: <nil>.
I1224 16:41:33.612156 1 draft.go:915] Proposal with key: 18251741697359166878 already applied. Skipping index: 1674102. Delta: <nil> Snapshot: <nil>.
2021/12/24 16:41:33 Unable to find txn with start ts: 2287584
github.com/dgraph-io/dgraph/x.AssertTruef
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1020
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1581
iluminae
(Kenan Kessler)
December 24, 2021, 5:44pm
5
I removed the node that was damaged by the last panic (alpha-10) and re-added him with completely fresh storage . Here is what happened:
crashloops
removeNode
new node comes up, gets assigned to free spot in that group
gets snapshot from leader
after snapshot, gets txns sent to him from snapshotTS->now
panics at same timestamp as before
see cutely annotated screenshot of this in grafana:
(update: and a second removeNode and re-add done after the next snapshot worked as expected, but this highlights that whatever this issue is, is committed to the raft wal.)
1 Like
iluminae
(Kenan Kessler)
December 25, 2021, 12:38pm
6
happened again, two members of a 3 member group panic over the same txn, but not the third member - I was able to removeNode on 2 members and currently re-adding them one at a time.
edit: third one encountered this during snapshot send so lost the whole cluster.
1 Like
iluminae
(Kenan Kessler)
December 27, 2021, 2:32pm
7
I have reduced our ingestion to 2 threads and have not encountered the issue since (after having to restore from backup 3 times Christmas day) So it seems to be related to disk contention/io throttling.
This is why we were so effected by the badgerdb manifest corruption too - our disks are extremely throttled (gcp pd-ssd @2TB , second highest tier of IOPS on pd-ssd on GCP) and even 2 threads inserting is enough to trigger throttling on our disks.
What configuration does dgraph cloud use on GCP to avoid throttling? I mean the real issue needs to be solved in Dgraph but I am tired of rebuilding my cluster. At least backup/restore is OSS now - if I had to bulk load each of these times it would be even more unpalatable.
1 Like
iluminae
(Kenan Kessler)
December 30, 2021, 3:58pm
8
just an update here: The problems with v21.12.0 were just too much to handle and I had to roll back to a v21.03.2+manifest corruption patch. We found divergent data on peers in a group and had to call it quits after that.
This new corruption/panic happens somewhere around the oracle keeping track of pending mutations and I feel like I am really close with my research in this comment but I need someone with more knowledge of this part of the codebase to validate.
The query performance of v21.12.0 is fantastic - unless you are inserting data - then it is incredibly slow. Rolling back to v21.03 has fixed this, query times during ingestion are back to ~150ms 50th percentile, as opposed to 2s-4s. See V21.12 slow queries
Please release a v21.03.4 with badger fixes if you can.
I am so glad backup+restore was open sourced. I really want to use it on v21.03 until v21.12 is fixed… maybe switch it over to OSS in a v21.03.4 release?
2 Likes
gkocur
January 28, 2022, 9:02am
9
@iluminae any progress with your setup?
I have the same problem: all 3 alphas in group crashes with the same error:
I0128 08:56:19.765476 18 groups.go:867] Got address of a Zero leader: intouch-dgraph-zero-1.intouch-dgraph-zero-headless.staging.svc.cluster.local:5080
I0128 08:56:19.766039 18 groups.go:881] Starting a new membership stream receive from intouch-dgraph-zero-1.intouch-dgraph-zero-headless.staging.svc.cluster.local:5080.
I0128 08:56:19.768307 18 groups.go:898] Received first state update from Zero: counter:73698 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"intouch-dgraph-alpha-0.intouch-dgraph-alpha-headless.staging.svc.cluster.local:7080" last_update:1643105936 > > members:<key:2 value:<id:2 group_id:1 addr:"intouch-dgraph-alpha-1.intouch-dgraph-alpha-headless.staging.svc.cluster.local:7080" last_update:1643337197 > > members:<key:3 value:<id:3 group_id:1 addr:"intouch-dgraph-alpha-2.intouch-dgraph-alpha-headless.staging.svc.cluster.local:7080" last_update:1643337206 > > tablets:<key:"0-assetState" value:<group_id:1 predicate:"0-assetState" on_disk_bytes:6201115 uncompressed_bytes:12128311 > > tablets:<key:"0-base64EncodedDefinition" value:<group_id:1 predicate:"0-base64EncodedDefinition" on_disk_bytes:138314480 uncompressed_bytes:582461588 > > tablets:<key:"0-base64EncodedWorkflowConfiguration" value:<group_id:1 predicate:"0-base64EncodedWorkflowConfiguration" on_disk_bytes:9819 uncompressed_bytes:32627 > > tablets:<key:"0-bitmap$init$0" value:<group_id:1 predicate:"0-bitmap$init$0" > > tablets:<key:"0-checkedOutBy" value:<group_id:1 predicate:"0-checkedOutBy" on_disk_bytes:207 uncompressed_bytes:65 > > tablets:<key:"0-childOf" value:<group_id:1 predicate:"0-childOf" on_disk_bytes:2774 uncompressed_bytes:24312 > > tablets:<key:"0-color" value:<group_id:1 predicate:"0-color" on_disk_bytes:190 uncompressed_bytes:58 > > tablets:<key:"0-contentType" value:<group_id:1 predicate:"0-contentType" on_disk_bytes:2059 uncompressed_bytes:5021 > > tablets:<key:"0-created" value:<group_id:1 predicate:"0-created" on_disk_bytes:288 uncompressed_bytes:457 > > tablets:<key:"0-createdBy" value:<group_id:1 predicate:"0-createdBy" on_disk_bytes:4924838 uncompressed_bytes:11617351 > > tablets:<key:"0-deliveryMethods" value:<group_id:1 predicate:"0-deliveryMethods" on_disk_bytes:471 uncompressed_bytes:1204 > > tablets:<key:"0-description" value:<group_id:1 predicate:"0-description" on_disk_bytes:10845 uncompressed_bytes:61598 > > tablets:<key:"0-dgraph.acl.rule" value:<group_id:1 predicate:"0-dgraph.acl.rule" > > tablets:<key:"0-dgraph.cors" value:<group_id:1 predicate:"0-dgraph.cors" > > tablets:<key:"0-dgraph.drop.op" value:<group_id:1 predicate:"0-dgraph.drop.op" > > tablets:<key:"0-dgraph.graphql.p_query" value:<group_id:1 predicate:"0-dgraph.graphql.p_query" > > tablets:<key:"0-dgraph.graphql.p_sha256hash" value:<group_id:1 predicate:"0-dgraph.graphql.p_sha256hash" > > tablets:<key:"0-dgraph.graphql.schema" value:<group_id:1 predicate:"0-dgraph.graphql.schema" > > tablets:<key:"0-dgraph.graphql.schema_created_at" value:<group_id:1 predicate:"0-dgraph.graphql.schema_created_at" > > tablets:<key:"0-dgraph.graphql.schema_history" value:<group_id:1 predicate:"0-dgraph.graphql.schema_history" > > tablets:<key:"0-dgraph.graphql.xid" value:<group_id:1 predicate:"0-dgraph.graphql.xid" > > tablets:<key:"0-dgraph.group.acl" value:<group_id:1 predicate:"0-dgraph.group.acl" > > tablets:<key:"0-dgraph.password" value:<group_id:1 predicate:"0-dgraph.password" > > tablets:<key:"0-dgraph.rule.permission" value:<group_id:1 predicate:"0-dgraph.rule.permission" > > tablets:<key:"0-dgraph.rule.predicate" value:<group_id:1 predicate:"0-dgraph.rule.predicate" > > tablets:<key:"0-dgraph.type" value:<group_id:1 predicate:"0-dgraph.type" on_disk_bytes:5681552 uncompressed_bytes:13055820 > > tablets:<key:"0-dgraph.user.group" value:<group_id:1 predicate:"0-dgraph.user.group" > > tablets:<key:"0-dgraph.xid" value:<group_id:1 predicate:"0-dgraph.xid" > > tablets:<key:"0-inWorkflow" value:<group_id:1 predicate:"0-inWorkflow" on_disk_bytes:283 uncompressed_bytes:446 > > tablets:<key:"0-locks" value:<group_id:1 predicate:"0-locks" on_disk_bytes:129336 uncompressed_bytes:656025 > > tablets:<key:"0-logger" value:<group_id:1 predicate:"0-logger" > > tablets:<key:"0-name" value:<group_id:1 predicate:"0-name" on_disk_bytes:3984283 uncompressed_bytes:5017783 > > tablets:<key:"0-nextVersion" value:<group_id:1 predicate:"0-nextVersion" on_disk_bytes:351421 uncompressed_bytes:2037143 > > tablets:<key:"0-nodeType" value:<group_id:1 predicate:"0-nodeType" on_disk_bytes:804367 uncompressed_bytes:4415210 > > tablets:<key:"0-ownerId" value:<group_id:1 predicate:"0-ownerId" on_disk_bytes:277 uncompressed_bytes:449 > > tablets:<key:"0-parentOf" value:<group_id:1 predicate:"0-parentOf" on_disk_bytes:296810 uncompressed_bytes:1429209 > > tablets:<key:"0-previousVersion" value:<group_id:1 predicate:"0-previousVersion" on_disk_bytes:264 uncompressed_bytes:423 > > tablets:<key:"0-processId" value:<group_id:1 predicate:"0-processId" > > tablets:<key:"0-publishedTo" value:<group_id:1 predicate:"0-publishedTo" on_disk_bytes:1867862 uncompressed_bytes:14688251 > > tablets:<key:"0-qaReviewer" value:<group_id:1 predicate:"0-qaReviewer" > > tablets:<key:"0-sourceOf" value:<group_id:1 predicate:"0-sourceOf" on_disk_bytes:195 uncompressed_bytes:61 > > tablets:<key:"0-staticId" value:<group_id:1 predicate:"0-staticId" on_disk_bytes:308 uncompressed_bytes:479 > > tablets:<key:"0-subtype" value:<group_id:1 predicate:"0-subtype" on_disk_bytes:196 uncompressed_bytes:60 > > tablets:<key:"0-tenant" value:<group_id:1 predicate:"0-tenant" on_disk_bytes:75020 uncompressed_bytes:638136 > > tablets:<key:"0-textRepresentation" value:<group_id:1 predicate:"0-textRepresentation" on_disk_bytes:5958 uncompressed_bytes:18225 > > tablets:<key:"0-uid" value:<group_id:1 predicate:"0-uid" > > tablets:<key:"0-updated" value:<group_id:1 predicate:"0-updated" on_disk_bytes:878016 uncompressed_bytes:6049651 > > tablets:<key:"0-usedBy" value:<group_id:1 predicate:"0-usedBy" on_disk_bytes:193 uncompressed_bytes:59 > > tablets:<key:"0-uses" value:<group_id:1 predicate:"0-uses" on_disk_bytes:245 uncompressed_bytes:412 > > tablets:<key:"0-validFrom" value:<group_id:1 predicate:"0-validFrom" on_disk_bytes:203 uncompressed_bytes:62 > > tablets:<key:"0-validTo" value:<group_id:1 predicate:"0-validTo" on_disk_bytes:192 uncompressed_bytes:60 > > tablets:<key:"0-versionId" value:<group_id:1 predicate:"0-versionId" on_disk_bytes:123306 uncompressed_bytes:412075 > > tablets:<key:"0-workflowState" value:<group_id:1 predicate:"0-workflowState" > > snapshot_ts:682184 checksum:10996207276505420656 > > zeros:<key:1 value:<id:1 addr:"intouch-dgraph-zero-0.intouch-dgraph-zero-headless.staging.svc.cluster.local:5080" > > zeros:<key:2 value:<id:2 addr:"intouch-dgraph-zero-1.intouch-dgraph-zero-headless.staging.svc.cluster.local:5080" leader:true > > zeros:<key:3 value:<id:3 addr:"intouch-dgraph-zero-2.intouch-dgraph-zero-headless.staging.svc.cluster.local:5080" > > maxUID:27278073 maxTxnTs:690000 maxRaftId:3 cid:"70c16084-935e-4cd6-a783-155a039fff7a" license:<maxNodes:18446744073709551615 expiryTs:1645064788 enabled:true >
2022/01/28 08:56:19 Unable to find txn with start ts: 682333
github.com/dgraph-io/dgraph/x.AssertTruef
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1020
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1581
It’s second time I got this, the first time it was for only 1 node and restoring it with fresh storage helped.
1 Like
iluminae
(Kenan Kessler)
January 28, 2022, 1:44pm
10
No, I had to give up on v21.12 - both this bug and the extremely slow queries during ingestion issue - it was too much to try and figure out alone.
2 Likes
gkocur
February 8, 2022, 2:04pm
11
@iluminae thank you for your answer, I have one more question for you:
do you remember if you used the client library version aligned to the server version?
I’m asking because from time we upgraded the client lib we didn’t have this issue any more.
But it can be a coincidence.
1 Like
iluminae
(Kenan Kessler)
February 9, 2022, 5:30pm
12
we use dgo, the last library version that was tagged was april, 2021 - we are using that one, both on v21.03 and v21.12.
1 Like
What patch are you using for the manifest corruption issue in v21.03.2?
iluminae
(Kenan Kessler)
February 18, 2022, 6:43pm
15
dgraph at tag v21.03.2 but with a change to go.mod to include badger that is d918b9904b2a
+ cherry picked 6ed45ae41e5a97b54b1ffb37a3ad86229dba7bf3
1 Like
This critical issue seems to be still open.
Is it because it’s a dicontinued 21.12.0 specific problem?
opened 01:08AM - 05 Jan 22 UTC
kind/bug
investigate
area/crash
community
Github issues are back! Putting in this one now:
Following on topic here:
ht… tps://discuss.dgraph.io/t/critical-bug-in-v21-12-permanently-crashloops-whole-groups/16383/7
A panic is continually observed here:
```
2021/12/22 06:56:16 Unable to find txn with start ts: 2178082
github.com/dgraph-io/dgraph/x.AssertTruef
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func2
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:970
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1025
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1581
```
If this panic occurs, the node will restart and replay its WAL - and will encounter the same issue again, permanently disabling the node. This tends to happen to entire groups at once.
v21.12:
```
$ dgraph version
Dgraph version : v21.12.0
Dgraph codename : zion
Dgraph SHA-256 : 078c75df9fa1057447c8c8afc10ea57cb0a29dfb22f9e61d8c334882b4b4eb37
Commit SHA-1 : d62ed5f15
Commit timestamp : 2021-12-02 21:20:09 +0530
Branch : HEAD
Go version : go1.17.3
jemalloc enabled : true
```
Can I avoid this problem if I upgrade to version 22.0 .x or higher?