Report a Dgraph Bug
What version of Dgraph are you using?
Dgraph Version
$ dgraph version
Dgraph version : v21.12.0
Dgraph codename : zion
Dgraph SHA-256 : 078c75df9fa1057447c8c8afc10ea57cb0a29dfb22f9e61d8c334882b4b4eb37
Commit SHA-1 : d62ed5f15
Commit timestamp : 2021-12-02 21:20:09 +0530
Branch : HEAD
Go version : go1.17.3
jemalloc enabled : true
Have you tried reproducing the issue with the latest release?
yes
What is the hardware spec (RAM, OS)?
k8s 16c 64GiB ram on GKE
Steps to reproduce the issue (command/config used to run Dgraph).
Use tablet move with large tablet, ~40GiB while ingestion is going.
I did a test tablet move of a smaller tablet that took ~9m to do and wanted to do a bigger tablet move too. Mutations were happening at the time.
Expected behaviour and actual result.
the target group of the move is now completely messed up, and is crash looping with the following panic:
I1220 17:13:42.463401 1 schema.go:496] Setting schema for attr 0-XXXXXXXXX: int, tokenizer: [], directive: NONE, count: false
2021/12/20 17:13:42 Unable to find txn with start ts: 1419795
github.com/dgraph-io/dgraph/x.AssertTruef
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1020
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1581
Which comes during setting the schema for the new tablet - it is looking for a transaction number that does not exist in the pendingTransactions
map. So - maybe it was removed from the map for some reason? The tablet move had been running for 20m but was not nearly complete. Here was the last Sending predicate
log message from the source group:
Sending predicate: [0-XXXXXXXXX] [19m44s] Scan (8): ~10.0 GiB/39 GiB at 11 MiB/sec. Sent: 10.0 GiB at 12 MiB/sec
Maybe the process that periodically aborts old transactions got the predicate move/schema for that move transaction and killed the whole group as a result.