I have three alpha clusters in my dgraph cluster and the leader node in one of them was found by me when it was corrupted.
I shut down that node and emptied the node content and dispersed the node in the cluster after the next round of election was completed.
After the node started, the log reality was being snapshot
from itself, infinite loop, and not starting properly.
I0507 05:20:52.998856 17 pool.go:162] CONNECTING to 192.168.3.12:7085,
W0507 05:20:52.999085 17 pool.go:204] Shutting down extra connection to 192.168.3.12:7085,
W0507 05:20:53.013977 17 pool.go:267] Connection lost with 192.168.3.12:7085. Error: rpc error: code = Canceled desc = context canceled,
I0507 05:20:53.528208 17 log.go:34] 11 [term: 1] received a MsgHeartbeat message with higher term from d [term: 10],
I0507 05:20:53.528253 17 log.go:34] raft.node: 11 elected leader d at term 10,
I0507 05:20:54.529615 17 log.go:34] 11 [commit: 0, lastindex: 0, lastterm: 0] starts to restore snapshot [index: 228689713, term: 9],
I0507 05:20:53.528241 17 log.go:34] 11 became follower at term 10,
I0507 05:20:54.529637 17 log.go:34] log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 228689713, term: 9],
I0507 05:20:54.529680 17 log.go:34] 11 restored progress of d [next = 228689714, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0],
I0507 05:20:54.529675 17 log.go:34] 11 restored progress of c [next = 228689714, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0],
I0507 05:20:54.529664 17 log.go:34] 11 restored progress of 1 [next = 228689714, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0],
I0507 05:20:54.529685 17 log.go:34] 11 [commit: 228689713] restored snapshot [index: 228689713, term: 9],
I0507 05:20:54.529723 17 draft.go:1151] Drain applyCh by reaching 0 before retrieving snapshot,
I0507 05:20:54.529732 17 draft.go:1048] Drained 0 proposals,
I0507 05:20:54.529748 17 draft.go:1169] ---> SNAPSHOT: {Context:id:1 group:1 addr:"192.168.3.12:7085" Index:228689713 ReadTs:750030278 Done:false SinceTs:0 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}. Group 1 from node id 0x1,
I0507 05:20:54.529858 17 draft.go:180] Operation started with id: opSnapshot,
I0507 05:20:54.529937 17 draft.go:124] Operation completed with id: opRollup,
I0507 05:20:54.529957 17 log.go:34] DropAll called. Blocking writes...,
I0507 05:20:54.529972 17 log.go:34] Writes flushed. Stopping compactions now...,
I0507 05:20:54.532769 17 log.go:34] Deleted 0 SSTables. Now deleting value logs...,
I0507 05:20:54.533062 17 log.go:34] Deleted 1 value log files. DropAll done.,
I0507 05:20:54.532942 17 log.go:34] Value logs deleted. Creating value log file: 1,
I0507 05:20:56.949787 17 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 05:20:57.061648 17 log.go:34] Resuming writes,
E0507 05:20:57.061766 17 draft.go:1178] While retrieving snapshot, error: cannot retrieve snapshot from peer: rpc error: code = Unknown desc = operation opSnapshot is already running. Retrying...
Rebooted again, the node prompted the following logs
I0507 04:57:57.737019 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 04:57:22.853193 18 groups.go:838] Received first state update from Zero: counter:348627489 groups:<key:1 value:<members:<key:12 value:<id:12 group_id:1 addr:"192.168.3.11:7085" > > members:<key:13 value:<id:13 group_id:1 addr:"192.168.3.9:7085" leader:true last_update:1620362238 > > members:<key:16 value:<id:16 group_id:1 addr:"192.168.3.12:7085" > > tablets:<key:"Company.invest" value:<group_id:1 predicate:"Company.invest" on_disk_bytes:138243336 uncompressed_bytes:333793149 > > tablets:<key:"Company.invested" value:<group_id:1 predicate:"Company.invested" on_disk_bytes:282273860 uncompressed_bytes:732290986 > > tablets:<key:"Company.name" value:<group_id:1 predicate:"Company.name" on_disk_bytes:10847047160 uncompressed_bytes:21461890783 > > tablets:<key:"Company.partner" value:<group_id:1 predicate:"Company.partner" on_disk_bytes:191268 uncompressed_bytes:414448 > > tablets:<key:"Company.province" value:<group_id:1 predicate:"Company.province" on_disk_bytes:3454631328 uncompressed_bytes:13143947348 > > tablets:<key:"Company.status" value:<group_id:1 predicate:"Company.status" on_disk_bytes:3632408968 move_ts:158085785 uncompressed_bytes:15218552217 > > tablets:<key:"Company.wasPartner" value:<group_id:1 predicate:"Company.wasPartner" on_disk_bytes:1217919 uncompressed_bytes:2676903 > > tablets:<key:"Company.wasShareholder" value:<group_id:1 predicate:"Company.wasShareholder" on_disk_bytes:5518536484 uncompressed_bytes:11594288399 > > tablets:<key:"Person.manager" value:<group_id:1 predicate:"Person.manager" on_disk_bytes:10143516597 uncompressed_bytes:23933423087 > > tablets:<key:"Person.name" value:<group_id:1 predicate:"Person.name" on_disk_bytes:14040226627 uncompressed_bytes:31936846416 > > tablets:<key:"Person.shareholder" value:<group_id:1 predicate:"Person.shareholder" on_disk_bytes:7352679342 uncompressed_bytes:18451745525 > > tablets:<key:"dgraph.acl.rule" value:<group_id:1 predicate:"dgraph.acl.rule" > > tablets:<key:"dgraph.cors" value:<group_id:1 predicate:"dgraph.cors" > > tablets:<key:"dgraph.drop.op" value:<group_id:1 predicate:"dgraph.drop.op" > > tablets:<key:"dgraph.graphql.p_query" value:<group_id:1 predicate:"dgraph.graphql.p_query" > > tablets:<key:"dgraph.graphql.p_sha256hash" value:<group_id:1 predicate:"dgraph.graphql.p_sha256hash" > > tablets:<key:"dgraph.graphql.schema" value:<group_id:1 predicate:"dgraph.graphql.schema" on_disk_bytes:893 uncompressed_bytes:1353 > > tablets:<key:"dgraph.graphql.schema_created_at" value:<group_id:1 predicate:"dgraph.graphql.schema_created_at" > > tablets:<key:"dgraph.graphql.schema_history" value:<group_id:1 predicate:"dgraph.graphql.schema_history" > > tablets:<key:"dgraph.graphql.xid" value:<group_id:1 predicate:"dgraph.graphql.xid" on_disk_bytes:293 uncompressed_bytes:194 > > tablets:<key:"dgraph.password" value:<group_id:1 predicate:"dgraph.password" > > tablets:<key:"dgraph.rule.permission" value:<group_id:1 predicate:"dgraph.rule.permission" > > tablets:<key:"dgraph.rule.predicate" value:<group_id:1 predicate:"dgraph.rule.predicate" > > tablets:<key:"dgraph.type" value:<group_id:1 predicate:"dgraph.type" on_disk_bytes:10034128152 uncompressed_bytes:39512178498 > > tablets:<key:"dgraph.user.group" value:<group_id:1 predicate:"dgraph.user.group" > > tablets:<key:"dgraph.xid" value:<group_id:1 predicate:"dgraph.xid" > > snapshot_ts:750030278 checksum:8443503671001154758 > > groups:<key:2 value:<members:<key:4 value:<id:4 group_id:2 addr:"192.168.3.12:7086" leader:true last_update:1620350750 > > members:<key:5 value:<id:5 group_id:2 addr:"192.168.3.11:7086" last_update:1614565210 > > members:<key:11 value:<id:11 group_id:2 addr:"192.168.3.9:7086" > > tablets:<key:"Company.wasManager" value:<group_id:2 predicate:"Company.wasManager" on_disk_bytes:8033516939 uncompressed_bytes:16687294035 > > tablets:<key:"Person.ownId" value:<group_id:2 predicate:"Person.ownId" on_disk_bytes:64700203072 uncompressed_bytes:80804274859 > > snapshot_ts:750030147 checksum:15550142297788716054 > > groups:<key:3 value:<members:<key:6 value:<id:6 group_id:3 addr:"192.168.3.12:7087" last_update:1614266507 > > members:<key:9 value:<id:9 group_id:3 addr:"192.168.3.9:7087" leader:true last_update:1614497767 > > members:<key:10 value:<id:10 group_id:3 addr:"192.168.3.11:7087" > > tablets:<key:"Company.wasLegal" value:<group_id:3 predicate:"Company.wasLegal" on_disk_bytes:7288173031 uncompressed_bytes:24292362318 > > tablets:<key:"Person.avatar" value:<group_id:3 predicate:"Person.avatar" on_disk_bytes:1042260269 uncompressed_bytes:1425349675 > > tablets:<key:"Person.introduction" value:<group_id:3 predicate:"Person.introduction" on_disk_bytes:1015856271 uncompressed_bytes:1385327083 > > tablets:<key:"Person.legal" value:<group_id:3 predicate:"Person.legal" on_disk_bytes:7956784737 uncompressed_bytes:24740111761 > > tablets:<key:"Person.pid" value:<group_id:3 predicate:"Person.pid" on_disk_bytes:677182082 uncompressed_bytes:1374559249 > > tablets:<key:"RlNode.rlid" value:<group_id:3 predicate:"RlNode.rlid" on_disk_bytes:47827793563 uncompressed_bytes:85103902097 > > snapshot_ts:750024824 checksum:718658416228190902 > > zeros:<key:1 value:<id:1 addr:"192.168.3.9:5080" > > zeros:<key:2 value:<id:2 addr:"192.168.3.11:5080" > > zeros:<key:3 value:<id:3 addr:"192.168.3.12:5080" leader:true > > maxLeaseId:557740000 maxTxnTs:750040000 maxRaftId:16 removed:<id:8 group_id:3 addr:"192.168.3.11:7087" last_update:1614497766 > removed:<id:7 group_id:2 addr:"192.168.3.9:7086" last_update:1619970799 > removed:<id:2 group_id:1 addr:"192.168.3.11:7085" last_update:1615292408 > removed:<id:3 group_id:1 addr:"192.168.3.9:7085" last_update:1618375477 > removed:<id:1 group_id:1 addr:"192.168.3.12:7085" last_update:1618375479 > removed:<id:14 group_id:1 addr:"192.168.3.12:7085" > removed:<id:15 group_id:1 addr:"192.168.3.12:7085" > cid:"ac2c3dcc-108d-4a2a-a763-51fb3f2c2dd6" license:<maxNodes:18446744073709551615 expiryTs:1616812311 > ,
I0507 04:57:23.828685 18 log.go:34] 10 [term: 0] received a MsgHeartbeat message with higher term from d [term: 10],
I0507 04:57:23.828739 18 log.go:34] 10 became follower at term 10,
I0507 04:57:23.828750 18 log.go:34] raft.node: 10 elected leader d at term 10,
I0507 04:57:27.735978 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 04:57:32.736129 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 04:57:37.736330 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 04:57:47.736611 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 04:57:52.736805 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 04:57:42.736481 18 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
What I want to do
start cluster
What I did
- retry add node
- restart node
Dgraph metadata
dgraph version
I0507 05:20:52.998856 17 pool.go:162] CONNECTING to 192.168.3.12:7085,
W0507 05:20:52.999085 17 pool.go:204] Shutting down extra connection to 192.168.3.12:7085,
W0507 05:20:53.013977 17 pool.go:267] Connection lost with 192.168.3.12:7085. Error: rpc error: code = Canceled desc = context canceled,
I0507 05:20:53.528208 17 log.go:34] 11 [term: 1] received a MsgHeartbeat message with higher term from d [term: 10],
I0507 05:20:53.528253 17 log.go:34] raft.node: 11 elected leader d at term 10,
I0507 05:20:54.529615 17 log.go:34] 11 [commit: 0, lastindex: 0, lastterm: 0] starts to restore snapshot [index: 228689713, term: 9],
I0507 05:20:53.528241 17 log.go:34] 11 became follower at term 10,
I0507 05:20:54.529637 17 log.go:34] log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 228689713, term: 9],
I0507 05:20:54.529680 17 log.go:34] 11 restored progress of d [next = 228689714, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0],
I0507 05:20:54.529675 17 log.go:34] 11 restored progress of c [next = 228689714, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0],
I0507 05:20:54.529664 17 log.go:34] 11 restored progress of 1 [next = 228689714, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0],
I0507 05:20:54.529685 17 log.go:34] 11 [commit: 228689713] restored snapshot [index: 228689713, term: 9],
I0507 05:20:54.529723 17 draft.go:1151] Drain applyCh by reaching 0 before retrieving snapshot,
I0507 05:20:54.529732 17 draft.go:1048] Drained 0 proposals,
I0507 05:20:54.529748 17 draft.go:1169] ---> SNAPSHOT: {Context:id:1 group:1 addr:"192.168.3.12:7085" Index:228689713 ReadTs:750030278 Done:false SinceTs:0 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}. Group 1 from node id 0x1,
I0507 05:20:54.529858 17 draft.go:180] Operation started with id: opSnapshot,
I0507 05:20:54.529937 17 draft.go:124] Operation completed with id: opRollup,
I0507 05:20:54.529957 17 log.go:34] DropAll called. Blocking writes...,
I0507 05:20:54.529972 17 log.go:34] Writes flushed. Stopping compactions now...,
I0507 05:20:54.532769 17 log.go:34] Deleted 0 SSTables. Now deleting value logs...,
I0507 05:20:54.533062 17 log.go:34] Deleted 1 value log files. DropAll done.,
I0507 05:20:54.532942 17 log.go:34] Value logs deleted. Creating value log file: 1,
I0507 05:20:56.949787 17 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.,
I0507 05:20:57.061648 17 log.go:34] Resuming writes,
E0507 05:20:57.061766 17 draft.go:1178] While retrieving snapshot, error: cannot retrieve snapshot from peer: rpc error: code = Unknown desc = operation opSnapshot is already running. Retrying...,
jemalloc enabled : true,
Go version : go1.15.5,
Branch : HEAD,
Commit timestamp : 2021-02-23 13:07:17 +0530,
Commit SHA-1 : 94f3a0430,
Dgraph SHA-256 : 0153cb8d3941ad5ad107e395b347e8d930a0b4ead6f4524521f7a525a9699167,
Dgraph codename : tchalla-2,
Dgraph version : v20.11.2
Update1
I’m sure it’s a bug. When the port of the node is changed, the cluster works fine now