Node is not active after dropAll/panic

Moved from GitHub dgraph/3108

Posted by makitka2007:

version: latest 1.0.12

i see in logs that one node performs dropAll (possible due to node was too far behind the leader? anyway it’s strange, because amazon network is ~5 Gbit as i tested it), then it fails on panic, restarted, and then after few messages in logs it just doesn’t show any activity, all queries to it hang. i restarted it, but nothing changes, few records in log and then again no activity. other 2 nodes work fine (but loading speed seems to be slow for some reason)

full log of that node is here:

DropAll called. Blocking writes...
panic: send on closed channel

goroutine 65776580 [running]:*DB).sendToWriteCh(0xc00047e300, 0xc01c5086b0, 0x2, 0x2, 0x6, 0x8, 0x13)
	/ext-go/1/src/ +0x166*Txn).commitAndSend(0xc031a14c00, 0x0, 0x0, 0x0)
	/ext-go/1/src/ +0x473*Txn).CommitWith(0xc031a14c00, 0xc01c5086a0)
	/ext-go/1/src/ +0x132*Txn).CommitAt(0xc031a14c00, 0x455a1, 0xc01c5086a0, 0x20, 0xc044538b00)
	/ext-go/1/src/ +0x59*TxnWriter).SetAt(0xc0133bd530, 0xc078972840, 0x13, 0x20, 0xc044538b00, 0xc, 0xc, 0x8, 0x455a1, 0x0, ...)
	/ext-go/1/src/ +0x1a7*TxnWriter).Send(0xc0133bd530, 0xc04c86bce0, 0x8d14e8, 0x20e89784e29)
	/ext-go/1/src/ +0xc9*node).rollupLists.func4(0xc037d63340, 0x1ee84a5c32f, 0x1fa2aa0)
	/ext-go/1/src/ +0x82*Stream).streamKVs.func1(0xc037d63340, 0xc023471e6c, 0x3)
	/ext-go/1/src/ +0x2a8*Stream).streamKVs(0xc07c6be380, 0x15ac8e0, 0xc0000b8010, 0x0, 0x0)
	/ext-go/1/src/ +0x567*Stream).Orchestrate.func2(0xc032420480, 0xc07c6be380, 0x15ac8e0, 0xc0000b8010)
	/ext-go/1/src/ +0x3f
created by*Stream).Orchestrate
	/ext-go/1/src/ +0x206

manishrjain commented :

This would happen if a rollup is going on while snapshot retrieval does a drop all – relatively rare, but we can add some operation tracking to avoid these simultaneous executions.

manishrjain commented :

Created a draft PR: Track operations, so we can cancel rollup if needed. by manishrjain · Pull Request #3181 · dgraph-io/dgraph · GitHub which can keep track of operations going on in Alphas and cancel them accordingly. Needs testing before I can submit it.