Hi team, I am running dgraph cluster in the following format -
Machine a - 1 zero, 3 alphas
Machine b - 3 alphas
with a replication factor of 3, thus forming two groups. Can also be seen in the below image.
Both machines are aws ec2 c5.2xlarge. I am inserting data with following schema -
entity_type: string .
entity_value: string @index(exact) @count .
phone: [uid] @count @reverse .
device: [uid] @count @reverse .
So I am trying dgraph for a use case where I can create links between different users(entity_values) on the basis of similar phone/device. These can be random number of links ranging from 10-500. I am inserting data using Kafka. Everything works fine when I start this, and all the predicates were present in the first group of alphas initially. After sometime, dgraph initiated rebalancing and and moved entity_type predicate to group 2. Everything was fine at this time as well, but when dgraph moved phone predicate to group 2, after a minute or so the node crashed and now both reads and writes are stopped. And only about half a million nodes were inserted from my side.
Here are some logs of zero -
Groups sorted by size: [{gid:1 size:0} {gid:2 size:0}]
I0818 07:00:45.461448 15 tablet.go:213] size_diff 0
I0818 07:08:45.461439 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:0} {gid:1 size:0}]
I0818 07:08:45.461462 15 tablet.go:213] size_diff 0
I0818 07:16:45.461506 15 tablet.go:208]
Groups sorted by size: [{gid:1 size:0} {gid:2 size:0}]
I0818 07:16:45.461532 15 tablet.go:213] size_diff 0
I0818 07:24:45.461429 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:0} {gid:1 size:49693467594}]
I0818 07:24:45.461454 15 tablet.go:213] size_diff 49693467594
I0818 07:24:45.461639 15 tablet.go:108] Going to move predicate: [entity_type], size: [22 GB] from group 1 to 2
I0818 07:24:45.461740 15 tablet.go:135] Starting move: predicate:"entity_type" source_gid:1 dest_gid:2 txn_ts:911263
I0818 07:24:47.536635 15 tablet.go:150] Move at Alpha done. Now proposing: tablet:<group_id:2 predicate:"entity_type" force:true space:22067073752 move_ts:911263 >
I0818 07:24:47.536874 15 tablet.go:156] Predicate move done for: [entity_type] from group 1 to 2
I0818 07:32:45.461442 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 07:32:45.461498 15 tablet.go:213] size_diff 20813563481
I0818 07:40:45.461432 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 07:40:45.461466 15 tablet.go:213] size_diff 20813563481
I0818 07:48:45.461428 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 07:48:45.461459 15 tablet.go:213] size_diff 20813563481
I0818 07:56:45.461443 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 07:56:45.461478 15 tablet.go:213] size_diff 20813563481
I0818 08:04:45.461445 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:04:45.461480 15 tablet.go:213] size_diff 20813563481
I0818 08:12:45.461462 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:12:45.462072 15 tablet.go:213] size_diff 20813563481
I0818 08:20:45.461425 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:20:45.461461 15 tablet.go:213] size_diff 20813563481
I0818 08:28:45.461469 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:28:45.461511 15 tablet.go:213] size_diff 20813563481
I0818 08:36:45.461452 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:36:45.461474 15 tablet.go:213] size_diff 20813563481
I0818 08:44:45.461428 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:44:45.461461 15 tablet.go:213] size_diff 20813563481
I0818 08:52:45.461413 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]
I0818 08:52:45.461435 15 tablet.go:213] size_diff 20813563481
I0818 09:00:45.461431 15 tablet.go:208]
Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:59877864994}]
I0818 09:00:45.461466 15 tablet.go:213] size_diff 37810791242
I0818 09:00:45.461658 15 tablet.go:108] Going to move predicate: [phone], size: [12 GB] from group 1 to 2
I0818 09:00:45.461742 15 tablet.go:135] Starting move: predicate:"phone" source_gid:1 dest_gid:2 txn_ts:1380036
I0818 09:00:53.291965 15 tablet.go:150] Move at Alpha done. Now proposing: tablet:<group_id:2 predicate:"phone" force:true space:12327615775 move_ts:1380036 >
I0818 09:00:53.292342 15 tablet.go:156] Predicate move done for: [phone] from group 1 to 2
I0818 09:07:28.926155 15 zero.go:417] Got connection request: cluster_info_only:true
I0818 09:07:28.926358 15 zero.go:435] Connected: cluster_info_only:true
I0818 09:08:45.461446 15 tablet.go:208]
Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]
I0818 09:08:45.461467 15 tablet.go:213] size_diff 23813934864
I0818 09:16:45.461453 15 tablet.go:208]
Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]
I0818 09:16:45.461498 15 tablet.go:213] size_diff 23813934864
I0818 09:24:45.461445 15 tablet.go:208]
Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]
I0818 09:24:45.461491 15 tablet.go:213] size_diff 23813934864
I0818 09:32:45.461436 15 tablet.go:208]
Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]
I0818 09:32:45.461487 15 tablet.go:213] size_diff 23813934864
I0818 09:40:45.461459 15 tablet.go:208]
Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]
I0818 09:40:45.461507 15 tablet.go:213] size_diff 23813934864
And here are the alpha logs from one of the crashed alphas -
I0818 08:50:58.293612 14 log.go:34] Compaction for level: 0 DONE
W0818 08:52:28.076360 14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 227ms. Breakdown: [{disk 195ms} {proposals 31ms} {advance 0s}] Num entries: 0. MustSync: false
I0818 08:53:35.459517 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 08:53:35.459551 14 log.go:34] Running for level: 0
I0818 08:53:37.107415 14 log.go:34] LOG Compact 0->1, del 7 tables, add 2 tables, took 1.647849234s
I0818 08:53:37.107440 14 log.go:34] Compaction for level: 0 DONE
I0818 08:56:13.761489 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 08:56:13.761529 14 log.go:34] Running for level: 0
I0818 08:56:14.800691 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.039142093s
I0818 08:56:14.800714 14 log.go:34] Compaction for level: 0 DONE
I0818 08:58:47.459494 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 08:58:47.459524 14 log.go:34] Running for level: 0
I0818 08:58:48.543593 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.084055233s
I0818 08:58:48.543620 14 log.go:34] Compaction for level: 0 DONE
I0818 09:00:45.681933 14 index.go:1183] Dropping predicate: [phone]
I0818 09:00:45.681956 14 log.go:34] DropPrefix Called
I0818 09:00:45.681964 14 log.go:34] Writes flushed. Stopping compactions now...
I0818 09:00:45.909719 14 log.go:34] Got compaction priority: {level:0 score:1.74 dropPrefix:[0 0 5 112 104 111 110 101]}
I0818 09:00:45.909766 14 log.go:34] Running for level: 0
I0818 09:00:46.981479 14 log.go:34] LOG Compact 0->1, del 3 tables, add 1 tables, took 1.071693499s
I0818 09:00:46.981511 14 log.go:34] Compaction for level: 0 DONE
I0818 09:00:46.981536 14 log.go:34] DropPrefix done
I0818 09:00:46.981545 14 log.go:34] Resuming writes
I0818 09:00:46.981556 14 schema.go:103] Deleting schema for predicate: [phone]
W0818 09:00:48.403542 14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 479ms. Breakdown: [{proposals 478ms} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
I0818 09:00:48.414585 14 predicate_move.go:46] Writing 381906 keys
I0818 09:00:50.246610 14 schema.go:452] Setting schema for attr phone: [uid], tokenizer: [], directive: REVERSE, count: true
I0818 09:00:52.886656 14 log.go:34] 6 is starting a new election at term 2
I0818 09:00:52.886671 14 log.go:34] 6 became pre-candidate at term 2
I0818 09:00:52.886674 14 log.go:34] 6 received MsgPreVoteResp from 6 at term 2
I0818 09:00:52.886936 14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 4 at term 2
I0818 09:00:52.887441 14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 5 at term 2
I0818 09:00:52.887652 14 log.go:34] raft.node: 6 lost leader 5 at term 2
I0818 09:00:53.478101 14 log.go:34] 6 became follower at term 2
I0818 09:00:53.501749 14 log.go:34] raft.node: 6 elected leader 5 at term 2
I0818 09:00:57.097364 14 log.go:34] 6 [logterm: 2, index: 1848399, vote: 0] cast MsgPreVote for 4 [logterm: 2, index: 1848399] at term 2
I0818 09:00:58.084532 14 predicate_move.go:46] Writing 78460 keys
I0818 09:00:58.719426 14 schema.go:452] Setting schema for attr phone: [uid], tokenizer: [], directive: REVERSE, count: true
I0818 09:00:58.828682 14 log.go:34] 6 is starting a new election at term 2
I0818 09:00:58.828697 14 log.go:34] 6 became pre-candidate at term 2
I0818 09:00:58.828701 14 log.go:34] 6 received MsgPreVoteResp from 6 at term 2
I0818 09:00:58.829111 14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 5 at term 2
I0818 09:00:58.829546 14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 4 at term 2
I0818 09:00:58.829757 14 log.go:34] raft.node: 6 lost leader 5 at term 2
I0818 09:00:58.829773 14 log.go:34] 6 became follower at term 2
I0818 09:00:58.829777 14 log.go:34] raft.node: 6 elected leader 5 at term 2
W0818 09:00:58.829902 14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 791ms. Breakdown: [{advance 747ms} {proposals 43ms} {disk 0s}] Num entries: 0. MustSync: false
I0818 09:01:36.459544 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:01:36.459594 14 log.go:34] Running for level: 0
I0818 09:01:37.577325 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.117714397s
I0818 09:01:37.577352 14 log.go:34] Compaction for level: 0 DONE
I0818 09:04:14.459463 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:04:14.459496 14 log.go:34] Running for level: 0
I0818 09:04:15.666420 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.206907277s
I0818 09:04:15.666445 14 log.go:34] Compaction for level: 0 DONE
I0818 09:06:58.459514 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:06:58.459544 14 log.go:34] Running for level: 0
I0818 09:06:59.736871 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.277311873s
I0818 09:06:59.736906 14 log.go:34] Compaction for level: 0 DONE
I0818 09:07:28.239184 14 log.go:34] 6 is starting a new election at term 2
I0818 09:07:28.239367 14 log.go:34] 6 became pre-candidate at term 2
I0818 09:07:28.239454 14 log.go:34] 6 received MsgPreVoteResp from 6 at term 2
I0818 09:07:28.241978 14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgPreVote request to 5 at term 2
I0818 09:07:28.243636 14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgPreVote request to 4 at term 2
I0818 09:07:28.246158 14 log.go:34] raft.node: 6 lost leader 5 at term 2
I0818 09:07:29.041093 14 log.go:34] 6 received MsgPreVoteResp from 5 at term 2
I0818 09:07:29.041114 14 log.go:34] 6 [quorum:2] has received 2 MsgPreVoteResp votes and 0 vote rejections
I0818 09:07:29.041125 14 log.go:34] 6 became candidate at term 3
I0818 09:07:29.041129 14 log.go:34] 6 received MsgVoteResp from 6 at term 3
I0818 09:07:29.041896 14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgVote request to 5 at term 3
I0818 09:07:29.042170 14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgVote request to 4 at term 3
I0818 09:07:29.043823 14 log.go:34] 6 received MsgVoteResp rejection from 5 at term 3
I0818 09:07:29.043842 14 log.go:34] 6 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections
I0818 09:07:30.248547 14 log.go:34] 6 [term: 3] ignored a MsgPreVoteResp message with lower term from 4 [term: 2]
I0818 09:07:30.248593 14 log.go:34] 6 received MsgVoteResp rejection from 4 at term 3
I0818 09:07:30.248599 14 log.go:34] 6 [quorum:2] has received 1 MsgVoteResp votes and 2 vote rejections
I0818 09:07:30.248609 14 log.go:34] 6 became follower at term 3
I0818 09:07:32.758567 14 log.go:34] 6 [logterm: 2, index: 1909465, vote: 6] cast MsgPreVote for 5 [logterm: 2, index: 1909466] at term 3
I0818 09:07:32.759571 14 log.go:34] 6 [term: 3] received a MsgVote message with higher term from 5 [term: 4]
I0818 09:07:32.759733 14 log.go:34] 6 became follower at term 4
I0818 09:07:32.760040 14 log.go:34] 6 [logterm: 2, index: 1909465, vote: 0] cast MsgVote for 5 [logterm: 2, index: 1909466] at term 4
I0818 09:07:32.760681 14 log.go:34] raft.node: 6 elected leader 5 at term 4
I0818 09:09:38.459441 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:09:38.459475 14 log.go:34] Running for level: 0
I0818 09:09:40.010848 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.551353146s
I0818 09:09:40.010876 14 log.go:34] Compaction for level: 0 DONE
I0818 09:12:17.459512 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:12:17.459545 14 log.go:34] Running for level: 0
I0818 09:12:18.799571 14 log.go:34] LOG Compact 0->1, del 6 tables, add 2 tables, took 1.340010232s
I0818 09:12:18.799607 14 log.go:34] Compaction for level: 0 DONE
W0818 09:14:21.846563 14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 222ms. Breakdown: [{advance 148ms} {disk 71ms} {proposals 3ms}] Num entries: 0. MustSync: false
The logs seemed to be normal only. What can be the case here? How to revive the crashed node back without losing the data? And how to proceed after such crashes?