Alpha node get restarted with: invalid memory address or nil pointer dereference

diggy · September 29, 2019, 12:33pm

Moved from GitHub dgraph/4095

Note, might be duplicated of: #4053

Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb2
Commit timestamp : 2019-09-04 00:12:51 -0700
Branch : HEAD
Go version : go1.12.7

Kubernetes 1.13.
Setup is from HA documentation of DGraph. 3 alpha nodes and 3 zero nodes

Stack trace:

I0927 14:57:44.191379       1 draft.go:415] List rollup at Ts 9496769: OK.
E0927 14:57:44.963204       1 log.go:32] Failure while flushing memtable to disk: : open p/000003.sst: file exists. Retrying...
E0927 14:57:45.963365       1 log.go:32] Failure while flushing memtable to disk: : open p/000004.sst: file exists. Retrying...
E0927 14:57:46.963501       1 log.go:32] Failure while flushing memtable to disk: : open p/000005.sst: file exists. Retrying...
E0927 14:57:47.979158       1 log.go:32] Failure while flushing memtable to disk: : open p/000006.sst: file exists. Retrying...
E0927 14:57:48.979361       1 log.go:32] Failure while flushing memtable to disk: : open p/000007.sst: file exists. Retrying...
E0927 14:57:49.979543       1 log.go:32] Failure while flushing memtable to disk: : open p/000008.sst: file exists. Retrying...
E0927 14:57:50.981458       1 log.go:32] Failure while flushing memtable to disk: : open p/000009.sst: file exists. Retrying...
E0927 14:57:51.985523       1 log.go:32] Failure while flushing memtable to disk: : open p/000010.sst: file exists. Retrying...
E0927 14:57:52.985724       1 log.go:32] Failure while flushing memtable to disk: : open p/000011.sst: file exists. Retrying...
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
	panic: Unclosed iterator at time of Txn.Discard.
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x119b130]
goroutine 33068104 [running]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).Discard(0xc0d68dae00)
	/tmp/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/txn.go:446 +0xde
panic(0x160b1a0, 0x21fa350)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Arena).reset(...)
	/tmp/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/arena.go:58
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Skiplist).DecrRef(...)

– Node get restarted and not able to join cluster again

diggy · September 29, 2019, 12:35pm

igormiletic commented :

After restart, error is constantly like below:

E0929 12:34:58.529130       1 groups.go:322] Error while proposing node removal: Node 0x4 not part of group
github.com/dgraph-io/dgraph/conn.(*Node).ProposePeerRemoval
	/tmp/go/src/github.com/dgraph-io/dgraph/conn/node.go:594
github.com/dgraph-io/dgraph/worker.(*groupi).applyState.func1
	/tmp/go/src/github.com/dgraph-io/dgraph/worker/groups.go:320
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337

diggy · September 29, 2019, 4:26pm

mangalaman93 commented :

cc @jarifibrahim @manishrjain

diggy · September 30, 2019, 7:40am

jarifibrahim commented :

This looks related to Handle nil memtable gracefully on Close by connorgorman · Pull Request #1034 · dgraph-io/badger · GitHub . We have a nil memtable in badger.

diggy · October 30, 2019, 10:52am

jarifibrahim commented :

I see three panics in the logs

Segmentation fault panic (Originated in badger or dgraph)
Nil pointer reference panic (originated in badger because badger was being closed)
Txn.Discard panic (originated in badger as a result of panic 2)

I don’t see any logs for the seg fault panic.
For panic 2 and 3. They are related. The memtable was nil (which happens only when badger is closed). This will be fixed via Handle nil memtable gracefully on Close by connorgorman · Pull Request #1034 · dgraph-io/badger · GitHub

@igormiletic is this issue reproducible? Could you help me reproduce this?

diggy · October 30, 2019, 11:07am

igormiletic commented :

Hm, this was possible to reproduce when you run some heavy query that will use all memory of the node.

We learned to be careful with memory that Alpha nodes use.

At the moment only I can suggest you o try something like:

run alpha nodes with lower memory (e.g. 4GB)
write some query with (e.g. with @recurse that will use all available memory)

In this cases our Kubernetes was not able to properly recover dgraph.

This issue was created when we were testing with our realtime data, then we realized that DGraph is not stable when it kick memory limits from any reason. Now we are trying to avoid it.

Try this, this is the best I can help at the moment, since we killed environment where we had this issue.

diggy · November 5, 2019, 6:31pm

martinmr commented :

@jarifibrahim I am not even sure if this stack trace was copied in its entirety.

Segmentation faults always print the stack. There are two segmentation faults but only one stacktrace. Also, for each element in the stacktrace, the method and line are printed in that order. But for the last element, there’s only the method name. So it doesn’t look like this stack trace is complete.

@igormiletic Would you happen to have kept the entire logs for this alpha? Thanks.

diggy · December 2, 2019, 6:52pm

manishrjain commented :

This issue seems relevant: Handle nil memtable gracefully on Close by connorgorman · Pull Request #1034 · dgraph-io/badger · GitHub

Topic		Replies	Views
Panic while migrating Dgraph	0	488	February 27, 2022
Dgraph Alpha Node unresponsive Dgraph	10	1091	September 10, 2022
When alpha is connected to zero,invalid memory address or nil pointer dereference Dgraph	3	452	August 18, 2020
Deleted node appears after alpha restart Dgraph kind:bug	0	524	November 21, 2021
Alpha fails on startup Dgraph kind:question , dgraph	5	496	February 9, 2021

Alpha node get restarted with: invalid memory address or nil pointer dereference

Related topics