Unrecoverable Assert failed error in a container of a Kubernetes Pod

I setup a dgraph (v0.9.3) cluster with 3 zero instances and 10 servers running in a single pod of Kubernetes.
One of the server is having an “Assert failed” error

This particular container cannot restart properly and always runs into the same error.
Maybe it’s reassigning the predicates when a mutation was received? as it happens ~10mins after ingesting starts.
Is there a way to get around this issue?

@jzhu077: Did you notice this error before restart also?

yes, it ran into this error then crashed.

Do you have the logs from zero before the crash.

Can you please try the same on latest master.

I don’t have the logs from zero before the crash since I have reset it, and extend the rebalancing time so I can carry on with my testing.

I will try the same on latest master after I finish the test.

it’s strange that only 1 of the 10 dgraph instances has this problem.

dgraph zero log

Groups sorted by size: [{gid:20 size:4554185} {gid:26 size:4657067} {gid:21 size:5071232} {gid:18 size:5787171} {gid:11 size:6685647} {gid:27 size:6792231} {gid:29 size:7302747} {gid:23 size:7506995} {gid:15 size:15740063} {gid:1 size:17982928}]

2017/12/13 08:15:32 tablet.go:170: size_diff 13428743
2017/12/13 08:15:32 tablet.go:87: Going to move predicate _dummy_ from 1 to 20
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgApp 1-->3
2017/12/13 08:15:32 tablet.go:91: Error while trying to move predicate _dummy_ from 1 to 20: rpc error: code = Unavailable desc = transport is closing
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgAppResp 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:32 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 08:15:35 zero.go:293: Got connection request: id:1 addr:"localhost:7092" 
2017/12/13 08:15:35 zero.go:389: Connected
2017/12/13 08:15:38 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:38 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:15:42 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 08:15:43 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 08:15:50 zero.go:293: Got connection request: id:1 addr:"localhost:7092" 
2017/12/13 08:15:50 zero.go:389: Connected
2017/12/13 08:15:52 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 08:15:53 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 08:15:57 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 08:15:57 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 08:16:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 08:16:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

...(a lot of repetitive logs)

2017/12/13 20:10:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:12 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:13 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:22 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:23 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:32 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:38 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:10:38 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:10:42 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:43 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:52 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:10:53 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:10:57 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:10:57 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:11:00 zero.go:293: Got connection request: id:1 addr:"localhost:7092" 
2017/12/13 20:11:00 zero.go:389: Connected
2017/12/13 20:11:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:12 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:13 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:22 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:23 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:32 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:38 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:11:38 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:11:42 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:43 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:52 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:11:53 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:11:57 node.go:485: RECEIVED: MsgReadIndex 3-->1
2017/12/13 20:11:57 node.go:162: 		SENDING: MsgReadIndexResp 1-->3
2017/12/13 20:12:02 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:03 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:12:12 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:13 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:12:22 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:23 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2017/12/13 20:12:32 oracle.go:372: No healthy connection found to leader of group 1
2017/12/13 20:12:33 pool.go:168: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

Did you try on latest Master branch, the issue shouldn’t occur on master

My mistake, I didn’t use the correct commit. cheers

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.