Unrecoverable Assert failed error in a container of a Kubernetes Pod

jzhu077 · December 10, 2017, 11:03pm

I setup a dgraph (v0.9.3) cluster with 3 zero instances and 10 servers running in a single pod of Kubernetes.
One of the server is having an “Assert failed” error

This particular container cannot restart properly and always runs into the same error.
Maybe it’s reassigning the predicates when a mutation was received? as it happens ~10mins after ingesting starts.
Is there a way to get around this issue?

janardhan · December 11, 2017, 12:39am

@jzhu077: Did you notice this error before restart also?

jzhu077 · December 11, 2017, 12:51am

yes, it ran into this error then crashed.

janardhan · December 11, 2017, 3:43am

Do you have the logs from zero before the crash.

janardhan · December 11, 2017, 7:05am

Can you please try the same on latest master.

jzhu077 · December 11, 2017, 9:07pm

I don’t have the logs from zero before the crash since I have reset it, and extend the rebalancing time so I can carry on with my testing.

I will try the same on latest master after I finish the test.

jzhu077 · December 13, 2017, 8:21pm

it’s strange that only 1 of the 10 dgraph instances has this problem.

dgraph zero log

Groups sorted by size: [{gid:20 size:4554185} {gid:26 size:4657067} {gid:21 size:5071232} {gid:18 size:5787171} {gid:11 size:6685647} {gid:27 size:6792231} {gid:29 size:7302747} {gid:23 size:7506995} {gid:15 size:15740063} {gid:1 size:17982928}]

2017/12/13 08:15:32 tablet.go:170: size_diff 13428743
2017/12/13 08:15:32 tablet.go:87: Going to move predicate _dummy_ from 1 to 20
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 tablet.go:91: Error while trying to move predicate _dummy_ from 1 to 20: rpc error: code = Unavailable desc = transport is closing
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 08:15:35 zero.go:293: Got connection request: id:1 addr:"localhost:7092" 
2017/12/13 08:15:35 zero.go:389: Connected
2017/12/13 08:15:38 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:38 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:42 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 08:15:43 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 08:15:50 zero.go:293: Got connection request: id:1 addr:"localhost:7092" 
2017/12/13 08:15:50 zero.go:389: Connected
2017/12/13 08:15:52 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 08:15:53 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 08:15:57 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:57 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:16:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 08:16:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

...(a lot of repetitive logs)

2017/12/13 20:10:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:12 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:13 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:22 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:23 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:32 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:38 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:10:38 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:10:42 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:43 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:52 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:53 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:57 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:10:57 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:11:00 zero.go:293: Got connection request: id:1 addr:"localhost:7092" 
2017/12/13 20:11:00 zero.go:389: Connected
2017/12/13 20:11:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:12 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:13 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:22 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:23 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:32 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:38 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:11:38 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:11:42 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:43 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:52 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:53 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:57 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:11:57 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:12:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:12:12 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:13 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:12:22 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:23 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:12:32 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

janardhan · December 13, 2017, 10:35pm

Did you try on latest Master branch, the issue shouldn’t occur on master

jzhu077 · December 14, 2017, 3:12am

My mistake, I didn’t use the correct commit. cheers

system · January 13, 2018, 3:12am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dgraph runs into a error loop and freezes the host Users	20	2095	February 21, 2018
Dgraph fails to start on restarts with Kind (Kubernetes) Dgraph	12	1561	October 30, 2020
Dgraph debug --postings ./p1/ --perd=City error:Assert failed Dgraph kind:question	1	368	June 1, 2021
Dgraph Zero crashes with Fatal error along with infinite loop in Alpha Dgraph	5	676	April 22, 2021
Frequent Zero leadership change Issues dgraph , kind:bug	2	410	May 21, 2021

Unrecoverable Assert failed error in a container of a Kubernetes Pod

Related Topics