Alpha node get restarted with: invalid memory address or nil pointer dereference

Moved from GitHub dgraph/4095

Posted by igormiletic:

Note, might be duplicated of: #4053

Dgraph version : v1.1.0
Dgraph SHA-256 : 7d4294a80f74692695467e2cf17f74648c18087ed7057d798f40e1d3a31d2095
Commit SHA-1 : ef7cdb2
Commit timestamp : 2019-09-04 00:12:51 -0700
Branch : HEAD
Go version : go1.12.7


Kubernetes 1.13.
Setup is from HA documentation of DGraph. 3 alpha nodes and 3 zero nodes


Stack trace:

I0927 14:57:44.191379       1 draft.go:415] List rollup at Ts 9496769: OK.
E0927 14:57:44.963204       1 log.go:32] Failure while flushing memtable to disk: : open p/000003.sst: file exists. Retrying...
E0927 14:57:45.963365       1 log.go:32] Failure while flushing memtable to disk: : open p/000004.sst: file exists. Retrying...
E0927 14:57:46.963501       1 log.go:32] Failure while flushing memtable to disk: : open p/000005.sst: file exists. Retrying...
E0927 14:57:47.979158       1 log.go:32] Failure while flushing memtable to disk: : open p/000006.sst: file exists. Retrying...
E0927 14:57:48.979361       1 log.go:32] Failure while flushing memtable to disk: : open p/000007.sst: file exists. Retrying...
E0927 14:57:49.979543       1 log.go:32] Failure while flushing memtable to disk: : open p/000008.sst: file exists. Retrying...
E0927 14:57:50.981458       1 log.go:32] Failure while flushing memtable to disk: : open p/000009.sst: file exists. Retrying...
E0927 14:57:51.985523       1 log.go:32] Failure while flushing memtable to disk: : open p/000010.sst: file exists. Retrying...
E0927 14:57:52.985724       1 log.go:32] Failure while flushing memtable to disk: : open p/000011.sst: file exists. Retrying...
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
	panic: Unclosed iterator at time of Txn.Discard.
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x119b130]
goroutine 33068104 [running]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).Discard(0xc0d68dae00)
	/tmp/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/txn.go:446 +0xde
panic(0x160b1a0, 0x21fa350)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Arena).reset(...)
	/tmp/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/arena.go:58
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Skiplist).DecrRef(...)

– Node get restarted and not able to join cluster again

igormiletic commented :

After restart, error is constantly like below:

E0929 12:34:58.529130       1 groups.go:322] Error while proposing node removal: Node 0x4 not part of group
github.com/dgraph-io/dgraph/conn.(*Node).ProposePeerRemoval
	/tmp/go/src/github.com/dgraph-io/dgraph/conn/node.go:594
github.com/dgraph-io/dgraph/worker.(*groupi).applyState.func1
	/tmp/go/src/github.com/dgraph-io/dgraph/worker/groups.go:320
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337

mangalaman93 commented :

cc @jarifibrahim @manishrjain

jarifibrahim commented :

This looks related to https://github.com/dgraph-io/badger/pull/1034 . We have a nil memtable in badger.

jarifibrahim commented :

I see three panics in the logs

  1. Segmentation fault panic (Originated in badger or dgraph)
  2. Nil pointer reference panic (originated in badger because badger was being closed)
  3. Txn.Discard panic (originated in badger as a result of panic 2)

I don’t see any logs for the seg fault panic.
For panic 2 and 3. They are related. The memtable was nil (which happens only when badger is closed). This will be fixed via https://github.com/dgraph-io/badger/pull/1034

@igormiletic is this issue reproducible? Could you help me reproduce this?

igormiletic commented :

Hm, this was possible to reproduce when you run some heavy query that will use all memory of the node.

We learned to be careful with memory that Alpha nodes use.

At the moment only I can suggest you o try something like:

  1. run alpha nodes with lower memory (e.g. 4GB)
  2. write some query with (e.g. with @recurse that will use all available memory)

In this cases our Kubernetes was not able to properly recover dgraph.

This issue was created when we were testing with our realtime data, then we realized that DGraph is not stable when it kick memory limits from any reason. Now we are trying to avoid it.

Try this, this is the best I can help at the moment, since we killed environment where we had this issue.

martinmr commented :

@jarifibrahim I am not even sure if this stack trace was copied in its entirety.

Segmentation faults always print the stack. There are two segmentation faults but only one stacktrace. Also, for each element in the stacktrace, the method and line are printed in that order. But for the last element, there’s only the method name. So it doesn’t look like this stack trace is complete.

@igormiletic Would you happen to have kept the entire logs for this alpha? Thanks.

manishrjain commented :

This issue seems relevant: https://github.com/dgraph-io/badger/pull/1034