Panic on tablet move v21.12.0

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 
Dgraph version   : v21.12.0
Dgraph codename  : zion
Dgraph SHA-256   : 078c75df9fa1057447c8c8afc10ea57cb0a29dfb22f9e61d8c334882b4b4eb37
Commit SHA-1     : d62ed5f15
Commit timestamp : 2021-12-02 21:20:09 +0530
Branch           : HEAD
Go version       : go1.17.3
jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

k8s 16c 64GiB ram on GKE

Steps to reproduce the issue (command/config used to run Dgraph).

Use tablet move with large tablet, ~40GiB while ingestion is going.

I did a test tablet move of a smaller tablet that took ~9m to do and wanted to do a bigger tablet move too. Mutations were happening at the time.

Expected behaviour and actual result.

the target group of the move is now completely messed up, and is crash looping with the following panic:

I1220 17:13:42.463401       1 schema.go:496] Setting schema for attr 0-XXXXXXXXX: int, tokenizer: [], directive: NONE, count: false

2021/12/20 17:13:42 Unable to find txn with start ts: 1419795
github.com/dgraph-io/dgraph/x.AssertTruef
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:107
github.com/dgraph-io/dgraph/worker.(*node).applyMutations
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:707
github.com/dgraph-io/dgraph/worker.(*node).applyCommitted
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:744
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:931
github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
        /ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:1020
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581

Which comes during setting the schema for the new tablet - it is looking for a transaction number that does not exist in the pendingTransactions map. So - maybe it was removed from the map for some reason? The tablet move had been running for 20m but was not nearly complete. Here was the last Sending predicate log message from the source group:

Sending predicate: [0-XXXXXXXXX] [19m44s] Scan (8): ~10.0 GiB/39 GiB at 11 MiB/sec. Sent: 10.0 GiB at 12 MiB/sec

Maybe the process that periodically aborts old transactions got the predicate move/schema for that move transaction and killed the whole group as a result.

1 Like

Were you able to restore node after this error?