Remove zero node

What I want to do

In K8s, verify the high availability of the cluster by removing the zero node.But I found that it didn’t use the new --idx after it was started, and --peer pointed to itself, causing it to fail to join the cluster. Observing the existing configuration file, it is found that the new --idx and --peer are not supported to point to the new leader. Is there a solution for this?
https://console.cloud.baidu-int.com/devops/icode/repos/baidu/crm/bizcrm-devops-helm-charts/blob/master:vendor/dgraph/dgraph-ha_online.yaml

What I did

I first removed the zero master node, then deleted the corresponding pvc, pv, and finally re-created a new pvc to restart the pod.

Dgraph metadata

dgraph version
Dgraph version   : v20.11.2
Dgraph codename  : tchalla-2
Commit timestamp : 2021-02-23 13:07:17 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

At the same time, there is one of the three alpha nodes, and a service exception occurs, causing the retrieval to fail. The log is as follows.

W0331 12:00:24.425274      20 pool.go:130] DISCONNECTING from dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
W0331 12:00:24.425284      20 pool.go:204] Shutting down extra connection to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0331 12:00:24.426411      20 node.go:189] Setting conf state to nodes:2 nodes:6 
I0331 12:00:25.423951      20 groups.go:159] Server is ready
I0331 12:00:25.423971      20 access_ee.go:390] ResetAcl closed
I0331 12:00:25.423976      20 access_ee.go:311] RefreshAcls closed
I0331 12:01:25.424175      20 graphql.go:75] Unable to upsert cors. Error: : context deadline exceeded
I0331 12:01:25.524283      20 graphql.go:41] ResetCors closed

After the above-mentioned unavailable alpha-0 node restarts, the new zero-0 node is used as the primary node due to configuration reasons, which causes the failure to join the cluster. Does this configuration support automatic selection of the current primary node?


alpha-0
alpha-0 startup log:

Copyright 2015-2020 Dgraph Labs, Inc.


I0407 14:02:28.328429      22 run.go:696] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000 MutationsNQuadLimit:1000000 PollInterval:1s GraphqlExtension:true GraphqlDebug:false GraphqlLambdaUrl:}
I0407 14:02:28.328473      22 run.go:697] x.WorkerConfig: {TmpDir:t ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr:dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080 ZeroAddr:[dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2021-04-07 14:02:27.714787499 +0000 UTC m=+0.014810086 LudicrousMode:false LudicrousConcurrency:2000 EncryptionKey:**** LogRequest:0 HardSync:false}
I0407 14:02:28.328528      22 run.go:698] worker.Config: {PostingDir:p PostingDirCompression:1 PostingDirCompressionLevel:0 WALDir:w MutationsMode:0 AuthToken: PBlockCacheSize:697932185 PIndexCacheSize:375809638 WalCache:0 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35,0 CacheMb:0}
I0407 14:02:28.328687      22 log.go:295] Found file: 224 First Index: 297561
I0407 14:02:28.328720      22 log.go:295] Found file: 225 First Index: 327561
I0407 14:02:28.328790      22 storage.go:132] Init Raft Storage with snap: 327429, first: 327430, last: 328842
I0407 14:02:28.328805      22 server_state.go:76] Setting Posting Dir Compression Level: 0
I0407 14:02:28.328816      22 server_state.go:120] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x2e0fef8 Compression:1 InMemory:false MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0407 14:02:28.365759      22 log.go:34] All 53 tables opened in 32ms
I0407 14:02:28.366133      22 log.go:34] Discard stats nextEmptySlot: 9
I0407 14:02:28.366194      22 log.go:34] Set nextTxnTs to 380006
I0407 14:02:28.367083      22 groups.go:99] Current Raft Id: 0x1
E0407 14:02:28.367091      22 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": Unable to find any servers for group: 1. closer err: <nil>
I0407 14:02:28.367205      22 worker.go:104] Worker listening at address: [::]:7080
E0407 14:02:28.368459      22 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0407 14:02:28.368495      22 run.go:519] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0407 14:02:28.368512      22 run.go:520] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0407 14:02:28.368536      22 run.go:552] gRPC server started.  Listening on port 9080
I0407 14:02:28.368548      22 run.go:553] HTTP server started.  Listening on port 8080
I0407 14:02:28.467270      22 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0407 14:02:28.471466      22 groups.go:127] Connected to group zero. Assigned group: 1
I0407 14:02:28.471480      22 groups.go:129] Raft Id after connection to Zero: 0x1
I0407 14:02:28.471503      22 draft.go:230] Node ID: 0x1 with GroupID: 1
I0407 14:02:28.471556      22 node.go:152] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0001120a0 Applied:327429 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2e0fef8 DisableProposalForwarding:false}
I0407 14:02:28.483161      22 node.go:310] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 6] Learners:[] XXX_unrecognized:[]} Index:327429 Term:27 XXX_unrecognized:[]}
I0407 14:02:28.483184      22 node.go:321] Found hardstate: {Term:27 Vote:6 Commit:328842 XXX_unrecognized:[]}
I0407 14:02:28.493713      22 node.go:326] Group 1 found 31282 entries
I0407 14:02:28.493739      22 draft.go:1689] Restarting node for group: 1
I0407 14:02:28.494192      22 node.go:189] Setting conf state to nodes:1 nodes:2 nodes:6 
I0407 14:02:28.494267      22 log.go:34] 1 became follower at term 27
I0407 14:02:28.494279      22 log.go:34] newRaft 1 [peers: [1,2,6], term: 27, commit: 328842, applied: 327429, lastindex: 328842, lastterm: 27]
I0407 14:02:28.494316      22 draft.go:180] Operation started with id: opRollup
I0407 14:02:28.494382      22 draft.go:1084] Found Raft progress: 328841
I0407 14:02:28.494407      22 groups.go:807] Got address of a Zero leader: dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0407 14:02:28.494546      22 groups.go:821] Starting a new membership stream receive from dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080.
I0407 14:02:28.495096      22 groups.go:838] Received first state update from Zero: counter:6 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080" > > > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080" leader:true > > maxRaftId:1 cid:"1e6ff595-d4e5-4eef-b34a-e9a5deccc286" license:<maxNodes:18446744073709551615 expiryTs:1619785753 enabled:true > 
I0407 14:02:28.496849      22 node.go:189] Setting conf state to nodes:2 nodes:6 
I0407 14:02:29.367930      22 pool.go:162] CONNECTING to dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080
I0407 14:02:29.495391      22 groups.go:487] Serving tablet for: dgraph.graphql.xid
I0407 14:02:29.495976      22 groups.go:487] Serving tablet for: type
I0407 14:02:29.496413      22 groups.go:487] Serving tablet for: dgraph.cors
I0407 14:02:29.496797      22 groups.go:487] Serving tablet for: source_type
I0407 14:02:29.497184      22 groups.go:487] Serving tablet for: dgraph.type
I0407 14:02:29.497629      22 groups.go:487] Serving tablet for: source
I0407 14:02:29.497956      22 groups.go:487] Serving tablet for: primary
I0407 14:02:29.498414      22 groups.go:487] Serving tablet for: model_type
I0407 14:02:29.498816      22 groups.go:487] Serving tablet for: dgraph.graphql.schema_history
I0407 14:02:29.499188      22 groups.go:487] Serving tablet for: dgraph.graphql.schema_created_at
I0407 14:02:29.499554      22 groups.go:487] Serving tablet for: source_id
I0407 14:02:29.499918      22 groups.go:487] Serving tablet for: dgraph.graphql.schema
I0407 14:02:29.500289      22 groups.go:487] Serving tablet for: dgraph.graphql.p_sha256hash
I0407 14:02:29.500648      22 groups.go:487] Serving tablet for: create_time
I0407 14:02:29.500990      22 groups.go:487] Serving tablet for: identity_id
I0407 14:02:29.501364      22 groups.go:487] Serving tablet for: dgraph.drop.op
I0407 14:02:29.501726      22 groups.go:487] Serving tablet for: account_relation
I0407 14:02:29.502071      22 groups.go:487] Serving tablet for: dgraph.graphql.p_query
I0407 14:02:29.502520      22 groups.go:487] Serving tablet for: relation
I0407 14:02:29.502871      22 groups.go:487] Serving tablet for: namespace
I0407 14:02:29.503226      22 groups.go:487] Serving tablet for: tenant_id
I0407 14:02:29.503300      22 groups.go:159] Server is ready
I0407 14:02:29.503310      22 access_ee.go:390] ResetAcl closed
I0407 14:02:29.503315      22 access_ee.go:311] RefreshAcls closed
I0407 14:02:33.370544      22 admin.go:697] No GraphQL schema in Dgraph; serving empty GraphQL API
I0407 14:03:29.503654      22 graphql.go:75] Unable to upsert cors. Error: While proposing: context deadline exceeded
I0407 14:03:29.603740      22 graphql.go:41] ResetCors closed
E0407 14:03:29.842789      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:31.068010      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:32.293097      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:33.517424      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:34.741001      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:35.965153      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:37.190705      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:38.413894      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:39.641258      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:40.867069      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:42.090316      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:43.315024      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:44.551644      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:45.775282      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:46.998709      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:48.221983      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:49.458350      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:50.684947      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:51.909719      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results
E0407 14:03:53.132926      22 run.go:767] Error while retrieving cors origins: GetCorsOrigins returned 0 results

As you’ve noted, when removing a Zero you’ll need to update the StatefulSet command so that the idx and the peer config are updated appropriately.

Yes, but the startup command is in the configuration file of k8s. Changing the StatefulSet command will cause all pods to restart. Is there any recommended automated configuration for this situation?

If you don’t want the RollingUpdate behavior when a StatefulSet config changes you can opt for the updateStrategy: OnDelete so that config updates restart the pods one by one automatically. StatefulSets | Kubernetes

1 Like

First of all, I am sorry to be able to reply to you now. Secondly, I did successfully add the node to the cluster. Thank you very much for your guidance.

1 Like