What I want to do
In K8s, verify the high availability of the cluster by removing the zero node.But I found that it didn’t use the new --idx after it was started, and --peer pointed to itself, causing it to fail to join the cluster. Observing the existing configuration file, it is found that the new --idx and --peer are not supported to point to the new leader. Is there a solution for this?
https://console.cloud.baidu-int.com/devops/icode/repos/baidu/crm/bizcrm-devops-helm-charts/blob/master:vendor/dgraph/dgraph-ha_online.yaml
What I did
I first removed the zero master node, then deleted the corresponding pvc, pv, and finally re-created a new pvc to restart the pod.
Dgraph metadata
dgraph version
Dgraph version : v20.11.2
Dgraph codename : tchalla-2
Commit timestamp : 2021-02-23 13:07:17 +0530
Branch : HEAD
Go version : go1.15.5
jemalloc enabled : true
At the same time, there is one of the three alpha nodes, and a service exception occurs, causing the retrieval to fail. The log is as follows.
W0331 12:00:24.425274 20 pool.go:130] DISCONNECTING from dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
W0331 12:00:24.425284 20 pool.go:204] Shutting down extra connection to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0331 12:00:24.426411 20 node.go:189] Setting conf state to nodes:2 nodes:6
I0331 12:00:25.423951 20 groups.go:159] Server is ready
I0331 12:00:25.423971 20 access_ee.go:390] ResetAcl closed
I0331 12:00:25.423976 20 access_ee.go:311] RefreshAcls closed
I0331 12:01:25.424175 20 graphql.go:75] Unable to upsert cors. Error: : context deadline exceeded
I0331 12:01:25.524283 20 graphql.go:41] ResetCors closed
After the above-mentioned unavailable alpha-0 node restarts, the new zero-0 node is used as the primary node due to configuration reasons, which causes the failure to join the cluster. Does this configuration support automatic selection of the current primary node?
alpha-0
alpha-0 startup log:
Copyright 2015-2020 Dgraph Labs, Inc.
I0407 14:02:28.328429 22 run.go:696] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000 MutationsNQuadLimit:1000000 PollInterval:1s GraphqlExtension:true GraphqlDebug:false GraphqlLambdaUrl:}
I0407 14:02:28.328473 22 run.go:697] x.WorkerConfig: {TmpDir:t ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr:dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080 ZeroAddr:[dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2021-04-07 14:02:27.714787499 +0000 UTC m=+0.014810086 LudicrousMode:false LudicrousConcurrency:2000 EncryptionKey:**** LogRequest:0 HardSync:false}
I0407 14:02:28.328528 22 run.go:698] worker.Config: {PostingDir:p PostingDirCompression:1 PostingDirCompressionLevel:0 WALDir:w MutationsMode:0 AuthToken: PBlockCacheSize:697932185 PIndexCacheSize:375809638 WalCache:0 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35,0 CacheMb:0}
I0407 14:02:28.328687 22 log.go:295] Found file: 224 First Index: 297561
I0407 14:02:28.328720 22 log.go:295] Found file: 225 First Index: 327561
I0407 14:02:28.328790 22 storage.go:132] Init Raft Storage with snap: 327429, first: 327430, last: 328842
I0407 14:02:28.328805 22 server_state.go:76] Setting Posting Dir Compression Level: 0
I0407 14:02:28.328816 22 server_state.go:120] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x2e0fef8 Compression:1 InMemory:false MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0407 14:02:28.365759 22 log.go:34] All 53 tables opened in 32ms
I0407 14:02:28.366133 22 log.go:34] Discard stats nextEmptySlot: 9
I0407 14:02:28.366194 22 log.go:34] Set nextTxnTs to 380006
I0407 14:02:28.367083 22 groups.go:99] Current Raft Id: 0x1
E0407 14:02:28.367091 22 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": Unable to find any servers for group: 1. closer err: <nil>
I0407 14:02:28.367205 22 worker.go:104] Worker listening at address: [::]:7080
E0407 14:02:28.368459 22 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0407 14:02:28.368495 22 run.go:519] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0407 14:02:28.368512 22 run.go:520] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0407 14:02:28.368536 22 run.go:552] gRPC server started. Listening on port 9080
I0407 14:02:28.368548 22 run.go:553] HTTP server started. Listening on port 8080
I0407 14:02:28.467270 22 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0407 14:02:28.471466 22 groups.go:127] Connected to group zero. Assigned group: 1
I0407 14:02:28.471480 22 groups.go:129] Raft Id after connection to Zero: 0x1
I0407 14:02:28.471503 22 draft.go:230] Node ID: 0x1 with GroupID: 1
I0407 14:02:28.471556 22 node.go:152] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0001120a0 Applied:327429 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2e0fef8 DisableProposalForwarding:false}
I0407 14:02:28.483161 22 node.go:310] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 6] Learners:[] XXX_unrecognized:[]} Index:327429 Term:27 XXX_unrecognized:[]}
I0407 14:02:28.483184 22 node.go:321] Found hardstate: {Term:27 Vote:6 Commit:328842 XXX_unrecognized:[]}
I0407 14:02:28.493713 22 node.go:326] Group 1 found 31282 entries
I0407 14:02:28.493739 22 draft.go:1689] Restarting node for group: 1
I0407 14:02:28.494192 22 node.go:189] Setting conf state to nodes:1 nodes:2 nodes:6
I0407 14:02:28.494267 22 log.go:34] 1 became follower at term 27
I0407 14:02:28.494279 22 log.go:34] newRaft 1 [peers: [1,2,6], term: 27, commit: 328842, applied: 327429, lastindex: 328842, lastterm: 27]
I0407 14:02:28.494316 22 draft.go:180] Operation started with id: opRollup
I0407 14:02:28.494382 22 draft.go:1084] Found Raft progress: 328841
I0407 14:02:28.494407 22 groups.go:807] Got address of a Zero leader: dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0407 14:02:28.494546 22 groups.go:821] Starting a new membership stream receive from dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080.
I0407 14:02:28.495096 22 groups.go:838] Received first state update from Zero: counter:6 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080" > > > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080" leader:true > > maxRaftId:1 cid:"1e6ff595-d4e5-4eef-b34a-e9a5deccc286" license:<maxNodes:18446744073709551615 expiryTs:1619785753 enabled:true >
I0407 14:02:28.496849 22 node.go:189] Setting conf state to nodes:2 nodes:6
I0407 14:02:29.367930 22 pool.go:162] CONNECTING to dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080
I0407 14:02:29.495391 22 groups.go:487] Serving tablet for: dgraph.graphql.xid
I0407 14:02:29.495976 22 groups.go:487] Serving tablet for: type
I0407 14:02:29.496413 22 groups.go:487] Serving tablet for: dgraph.cors
I0407 14:02:29.496797 22 groups.go:487] Serving tablet for: source_type
I0407 14:02:29.497184 22 groups.go:487] Serving tablet for: dgraph.type
I0407 14:02:29.497629 22 groups.go:487] Serving tablet for: source
I0407 14:02:29.497956 22 groups.go:487] Serving tablet for: primary
I0407 14:02:29.498414 22 groups.go:487] Serving tablet for: model_type
I0407 14:02:29.498816 22 groups.go:487] Serving tablet for: dgraph.graphql.schema_history
I0407 14:02:29.499188 22 groups.go:487] Serving tablet for: dgraph.graphql.schema_created_at
I0407 14:02:29.499554 22 groups.go:487] Serving tablet for: source_id
I0407 14:02:29.499918 22 groups.go:487] Serving tablet for: dgraph.graphql.schema
I0407 14:02:29.500289 22 groups.go:487] Serving tablet for: dgraph.graphql.p_sha256hash
I0407 14:02:29.500648 22 groups.go:487] Serving tablet for: create_time
I0407 14:02:29.500990 22 groups.go:487] Serving tablet for: identity_id
I0407 14:02:29.501364 22 groups.go:487] Serving tablet for: dgraph.drop.op
I0407 14:02:29.501726 22 groups.go:487] Serving tablet for: account_relation
I0407 14:02:29.502071 22 groups.go:487] Serving tablet for: dgraph.graphql.p_query
I0407 14:02:29.502520 22 groups.go:487] Serving tablet for: relation
I0407 14:02:29.502871 22 groups.go:487] Serving tablet for: namespace
I0407 14:02:29.503226 22 groups.go:487] Serving tablet for: tenant_id
I0407 14:02:29.503300 22 groups.go:159] Server is ready
I0407 14:02:29.503310 22 access_ee.go:390] ResetAcl closed
I0407 14:02:29.503315 22 access_ee.go:311] RefreshAcls closed
I0407 14:02:33.370544 22 admin.go:697] No GraphQL schema in Dgraph; serving empty GraphQL API
I0407 14:03:29.503654 22 graphql.go:75] Unable to upsert cors. Error: While proposing: context deadline exceeded
I0407 14:03:29.603740 22 graphql.go:41] ResetCors closed
E0407 14:03:29.842789 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:31.068010 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:32.293097 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:33.517424 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:34.741001 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:35.965153 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:37.190705 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:38.413894 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:39.641258 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:40.867069 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:42.090316 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:43.315024 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:44.551644 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:45.775282 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:46.998709 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:48.221983 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:49.458350 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:50.684947 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:51.909719 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:53.132926 22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
dmai
(Daniel Mai)
April 7, 2021, 3:35pm
4
As you’ve noted, when removing a Zero you’ll need to update the StatefulSet command
so that the idx and the peer config are updated appropriately.
dmai:
StatefulSet
Yes, but the startup command is in the configuration file of k8s. Changing the StatefulSet command will cause all pods to restart. Is there any recommended automated configuration for this situation?
dmai
(Daniel Mai)
April 8, 2021, 4:06am
6
If you don’t want the RollingUpdate behavior when a StatefulSet config changes you can opt for the updateStrategy: OnDelete
so that config updates restart the pods one by one automatically. StatefulSets | Kubernetes
1 Like
First of all, I am sorry to be able to reply to you now. Secondly, I did successfully add the node to the cluster. Thank you very much for your guidance.
1 Like