Node is not active after dropAll/panic

Moved from GitHub dgraph/3108

Posted by makitka2007:

version: latest 1.0.12

i see in logs that one node performs dropAll (possible due to node was too far behind the leader? anyway it’s strange, because amazon network is ~5 Gbit as i tested it), then it fails on panic, restarted, and then after few messages in logs it just doesn’t show any activity, all queries to it hang. i restarted it, but nothing changes, few records in log and then again no activity. other 2 nodes work fine (but loading speed seems to be slow for some reason)

full log of that node is here: https://drive.google.com/file/d/14rpMmGVoIsk7D-RwtnkOatSH7qPmP2e2/view?usp=sharing

DropAll called. Blocking writes...
panic: send on closed channel

goroutine 65776580 [running]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*DB).sendToWriteCh(0xc00047e300, 0xc01c5086b0, 0x2, 0x2, 0x6, 0x8, 0x13)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/db.go:639 +0x166
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).commitAndSend(0xc031a14c00, 0x0, 0x0, 0x0)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/txn.go:539 +0x473
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).CommitWith(0xc031a14c00, 0xc01c5086a0)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/txn.go:643 +0x132
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).CommitAt(0xc031a14c00, 0x455a1, 0xc01c5086a0, 0x20, 0xc044538b00)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/managed_db.go:56 +0x59
github.com/dgraph-io/dgraph/posting.(*TxnWriter).SetAt(0xc0133bd530, 0xc078972840, 0x13, 0x20, 0xc044538b00, 0xc, 0xc, 0x8, 0x455a1, 0x0, ...)
	/ext-go/1/src/github.com/dgraph-io/dgraph/posting/writer.go:121 +0x1a7
github.com/dgraph-io/dgraph/posting.(*TxnWriter).Send(0xc0133bd530, 0xc04c86bce0, 0x8d14e8, 0x20e89784e29)
	/ext-go/1/src/github.com/dgraph-io/dgraph/posting/writer.go:64 +0xc9
github.com/dgraph-io/dgraph/worker.(*node).rollupLists.func4(0xc037d63340, 0x1ee84a5c32f, 0x1fa2aa0)
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:823 +0x82
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Stream).streamKVs.func1(0xc037d63340, 0xc023471e6c, 0x3)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/stream.go:229 +0x2a8
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Stream).streamKVs(0xc07c6be380, 0x15ac8e0, 0xc0000b8010, 0x0, 0x0)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/stream.go:260 +0x567
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Stream).Orchestrate.func2(0xc032420480, 0xc07c6be380, 0x15ac8e0, 0xc0000b8010)
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/stream.go:311 +0x3f
created by github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Stream).Orchestrate
	/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/stream.go:309 +0x206

manishrjain commented :

This would happen if a rollup is going on while snapshot retrieval does a drop all – relatively rare, but we can add some operation tracking to avoid these simultaneous executions.

manishrjain commented :

Created a draft PR: Track operations, so we can cancel rollup if needed. by manishrjain · Pull Request #3181 · dgraph-io/dgraph · GitHub which can keep track of operations going on in Alphas and cancel them accordingly. Needs testing before I can submit it.