Alpha node restart failed


Report a Dgraph Bug

Dgraph is deployed in k8s. I need to verify whether the data of the alpha node is normal after another machine restarts. Because the data of the alpha node is stored in PVC, I manually deleted the PVC, but the corresponding pod failed to restart after deleting. Please take a look, thank you!

++ hostname -f
+ dgraph alpha --my=dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080 --zero dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080,dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080,dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080
[Decoder]: Using assembly version of decoder
Page Size: 4096
[Sentry] 2021/03/16 14:45:26 Integration installed: ContextifyFrames
[Sentry] 2021/03/16 14:45:26 Integration installed: Environment
[Sentry] 2021/03/16 14:45:26 Integration installed: Modules
[Sentry] 2021/03/16 14:45:26 Integration installed: IgnoreErrors
[Decoder]: Using assembly version of decoder
Page Size: 4096
[Sentry] 2021/03/16 14:45:26 Integration installed: ContextifyFrames
[Sentry] 2021/03/16 14:45:26 Integration installed: Environment
[Sentry] 2021/03/16 14:45:26 Integration installed: Modules
[Sentry] 2021/03/16 14:45:26 Integration installed: IgnoreErrors
I0316 14:45:26.815633      18 sentry_integration.go:48] This instance of Dgraph will send anonymous reports of panics back to Dgraph Labs via Sentry. No confidential information is sent. These reports help improve Dgraph. To opt-out, restart your instance with the --enable_sentry=false flag. For more info, see https://dgraph.io/docs/howto/#data-handling.
I0316 14:45:26.816333      18 util_ee.go:126] KeyReader instantiated of type <nil>
I0316 14:45:27.021901      18 init.go:107] 

Dgraph version   : v20.11.0
Dgraph codename  : tchalla
Dgraph SHA-256   : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1     : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs/.
For discussions about Dgraph     , visit http://discuss.dgraph.io.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0316 14:45:27.021924      18 run.go:696] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000 MutationsNQuadLimit:1000000 PollInterval:1s GraphqlExtension:true GraphqlDebug:false GraphqlLambdaUrl:}
I0316 14:45:27.021966      18 run.go:697] x.WorkerConfig: {TmpDir:t ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr:dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080 ZeroAddr:[dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2021-03-16 14:45:26.398492004 +0000 UTC m=+0.019960014 LudicrousMode:false LudicrousConcurrency:2000 EncryptionKey:**** LogRequest:0 HardSync:false}
I0316 14:45:27.022019      18 run.go:698] worker.Config: {PostingDir:p PostingDirCompression:1 PostingDirCompressionLevel:0 WALDir:w MutationsMode:0 AuthToken: PBlockCacheSize:697932185 PIndexCacheSize:375809638 WalCache:0 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35,0 CacheMb:0}
I0316 14:45:27.023020      18 log.go:295] Found file: 12 First Index: 0
I0316 14:45:27.024088      18 storage.go:132] Init Raft Storage with snap: 0, first: 1, last: 0
I0316 14:45:27.024113      18 server_state.go:76] Setting Posting Dir Compression Level: 0
I0316 14:45:27.024126      18 server_state.go:120] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x2dffcb8 Compression:1 InMemory:false MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0316 14:45:27.030413      18 log.go:34] All 0 tables opened in 0s
I0316 14:45:27.030569      18 log.go:34] Discard stats nextEmptySlot: 0
I0316 14:45:27.030593      18 log.go:34] Set nextTxnTs to 0
I0316 14:45:27.030765      18 log.go:34] Deleting empty file: p/000011.vlog
E0316 14:45:27.031424      18 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": Unable to find any servers for group: 1. closer err: <nil>
I0316 14:45:27.031473      18 groups.go:99] Current Raft Id: 0x1
I0316 14:45:27.031654      18 worker.go:104] Worker listening at address: [::]:7080
I0316 14:45:27.032811      18 run.go:519] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0316 14:45:27.032827      18 run.go:520] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
E0316 14:45:27.032801      18 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0316 14:45:27.032851      18 run.go:552] gRPC server started.  Listening on port 9080
I0316 14:45:27.032863      18 run.go:553] HTTP server started.  Listening on port 8080
I0316 14:45:27.131924      18 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0316 14:45:27.136471      18 groups.go:127] Connected to group zero. Assigned group: 0
I0316 14:45:27.136485      18 groups.go:129] Raft Id after connection to Zero: 0x1
I0316 14:45:27.136532      18 pool.go:162] CONNECTING to dgraph-alpha-1.dgraph-alpha.crm-test.svc.cluster.local:7080
I0316 14:45:27.136553      18 pool.go:162] CONNECTING to dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080
I0316 14:45:27.136606      18 pool.go:162] CONNECTING to dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080
I0316 14:45:27.136632      18 pool.go:162] CONNECTING to dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080
I0316 14:45:27.136649      18 draft.go:230] Node ID: 0x1 with GroupID: 1
I0316 14:45:27.136708      18 node.go:152] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00003e500 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2dffcb8 DisableProposalForwarding:false}
I0316 14:45:27.136778      18 node.go:326] Group 1 found 0 entries
I0316 14:45:27.136787      18 draft.go:1650] Calling IsPeer
I0316 14:45:27.138466      18 draft.go:1655] Done with IsPeer call
I0316 14:45:27.138497      18 draft.go:1689] Restarting node for group: 1
I0316 14:45:27.138544      18 log.go:34] 1 became follower at term 0
I0316 14:45:27.138556      18 log.go:34] newRaft 1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
I0316 14:45:27.138577      18 draft.go:180] Operation started with id: opRollup
I0316 14:45:27.138628      18 groups.go:807] Got address of a Zero leader: dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0316 14:45:27.138634      18 draft.go:1084] Found Raft progress: 0
I0316 14:45:27.138845      18 groups.go:821] Starting a new membership stream receive from dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080.
I0316 14:45:27.140563      18 groups.go:838] Received first state update from Zero: counter:75876 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080" last_update:1615898699 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-alpha-1.dgraph-alpha.crm-test.svc.cluster.local:7080" last_update:1615898476 > > members:<key:3 value:<id:3 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080" leader:true last_update:1615898821 > > tablets:<key:"account_relation" value:<group_id:1 predicate:"account_relation" > > tablets:<key:"create_time" value:<group_id:1 predicate:"create_time" on_disk_bytes:209158072 uncompressed_bytes:909846316 > > tablets:<key:"dgraph.cors" value:<group_id:1 predicate:"dgraph.cors" on_disk_bytes:199 uncompressed_bytes:75 > > tablets:<key:"dgraph.drop.op" value:<group_id:1 predicate:"dgraph.drop.op" > > tablets:<key:"dgraph.graphql.p_query" value:<group_id:1 predicate:"dgraph.graphql.p_query" > > tablets:<key:"dgraph.graphql.p_sha256hash" value:<group_id:1 predicate:"dgraph.graphql.p_sha256hash" > > tablets:<key:"dgraph.graphql.schema" value:<group_id:1 predicate:"dgraph.graphql.schema" > > tablets:<key:"dgraph.graphql.schema_created_at" value:<group_id:1 predicate:"dgraph.graphql.schema_created_at" > > tablets:<key:"dgraph.graphql.schema_history" value:<group_id:1 predicate:"dgraph.graphql.schema_history" > > tablets:<key:"dgraph.graphql.xid" value:<group_id:1 predicate:"dgraph.graphql.xid" > > tablets:<key:"dgraph.type" value:<group_id:1 predicate:"dgraph.type" on_disk_bytes:181031655 uncompressed_bytes:652629421 > > tablets:<key:"identity_id" value:<group_id:1 predicate:"identity_id" on_disk_bytes:668112128 uncompressed_bytes:1123320948 > > tablets:<key:"model_type" value:<group_id:1 predicate:"model_type" > > tablets:<key:"namespace" value:<group_id:1 predicate:"namespace" on_disk_bytes:118698642 uncompressed_bytes:441263005 > > tablets:<key:"primary" value:<group_id:1 predicate:"primary" on_disk_bytes:51624206 uncompressed_bytes:189042002 > > tablets:<key:"relation" value:<group_id:1 predicate:"relation" on_disk_bytes:141790681 uncompressed_bytes:303153262 > > tablets:<key:"source" value:<group_id:1 predicate:"source" on_disk_bytes:150636664 uncompressed_bytes:321591343 > > tablets:<key:"source_id" value:<group_id:1 predicate:"source_id" on_disk_bytes:326471204 uncompressed_bytes:564324005 > > tablets:<key:"source_type" value:<group_id:1 predicate:"source_type" on_disk_bytes:41291249 uncompressed_bytes:165319164 > > tablets:<key:"tenant_id" value:<group_id:1 predicate:"tenant_id" on_disk_bytes:478459823 uncompressed_bytes:870357015 > > tablets:<key:"type" value:<group_id:1 predicate:"type" on_disk_bytes:110518468 uncompressed_bytes:384008394 > > snapshot_ts:278852 checksum:12249075648478441628 > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080" > > zeros:<key:3 value:<id:3 addr:"dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080" > > maxLeaseId:15190000 maxTxnTs:290000 maxRaftId:3 cid:"fec6a030-0c07-4fe9-a775-219482e41177" license:<maxNodes:18446744073709551615 expiryTs:1617290026 enabled:true > 
I0316 14:45:29.533598      18 log.go:34] 1 [term: 0] received a MsgHeartbeat message with higher term from 3 [term: 18]
I0316 14:45:29.533633      18 log.go:34] 1 became follower at term 18
2021/03/16 14:45:29 tocommit(266103) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(266103) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 248 [running]:
log.Panicf(0x1dd447a, 0x5d, 0xc00043d860, 0x2, 0x2)
	/usr/local/go/src/log/log.go:358 +0xc5
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x2dffcb8, 0x1dd447a, 0x5d, 0xc00043d860, 0x2, 0x2)
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/log.go:40 +0x53
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc00021e380, 0x40f77)
	/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/log.go:203 +0x135
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc00b870000, 0x8, 0x1, 0x3, 0x12, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1324 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc00b870000, 0x8, 0x1, 0x3, 0x12, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:1269 +0x439
go.etcd.io/etcd/raft.(*raft).Step(0xc00b870000, 0x8, 0x1, 0x3, 0x12, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/raft.go:971 +0x1218
go.etcd.io/etcd/raft.(*node).run(0xc0006d6e40, 0xc00b870000)
	/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:357 +0x1178
created by go.etcd.io/etcd/raft.RestartNode
	/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190228193606-a943ad0ee4c9/raft/node.go:246 +0x346
[Sentry] 2021/03/16 14:45:29 Sending fatal event [1487316add634967ad1d228b5b49d3cb] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2021/03/16 14:45:31 Buffer flushing reached the timeout.

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 Dgraph version   : v20.11.0
Dgraph codename  : tchalla
Dgraph SHA-256   : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1     : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true
PASTE YOUR RESULTS HERE

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-dgraph-alpha-0
    ReadOnly:   false
  default-token-w2qj9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-w2qj9
    Optional:    false

Steps to reproduce the issue (command/config used to run Dgraph).

Manually delete the PVC corresponding to alpha pod


After I tried to remove the failed node, the restarted node prompted repeated raft Id

++ hostname -f
+ dgraph alpha --my=dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080 --zero dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080,dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080,dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080
[Decoder]: Using assembly version of decoder
Page Size: 4096
[Sentry] 2021/03/18 11:38:34 Integration installed: ContextifyFrames
[Sentry] 2021/03/18 11:38:34 Integration installed: Environment
[Sentry] 2021/03/18 11:38:34 Integration installed: Modules
[Sentry] 2021/03/18 11:38:34 Integration installed: IgnoreErrors
[Decoder]: Using assembly version of decoder
Page Size: 4096
[Sentry] 2021/03/18 11:38:34 Integration installed: ContextifyFrames
[Sentry] 2021/03/18 11:38:34 Integration installed: Environment
[Sentry] 2021/03/18 11:38:34 Integration installed: Modules
[Sentry] 2021/03/18 11:38:34 Integration installed: IgnoreErrors
I0318 11:38:35.172320      19 sentry_integration.go:48] This instance of Dgraph will send anonymous reports of panics back to Dgraph Labs via Sentry. No confidential information is sent. These reports help improve Dgraph. To opt-out, restart your instance with the --enable_sentry=false flag. For more info, see https://dgraph.io/docs/howto/#data-handling.
I0318 11:38:35.379017      19 init.go:107] 

Dgraph version   : v20.11.2
Dgraph codename  : tchalla-2
Dgraph SHA-256   : 0153cb8d3941ad5ad107e395b347e8d930a0b4ead6f4524521f7a525a9699167
Commit SHA-1     : 94f3a0430
Commit timestamp : 2021-02-23 13:07:17 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs/.
For discussions about Dgraph     , visit http://discuss.dgraph.io.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0318 11:38:35.379045      19 run.go:696] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000 MutationsNQuadLimit:1000000 PollInterval:1s GraphqlExtension:true GraphqlDebug:false GraphqlLambdaUrl:}
I0318 11:38:35.379086      19 run.go:697] x.WorkerConfig: {TmpDir:t ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr:dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080 ZeroAddr:[dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2021-03-18 11:38:34.735965472 +0000 UTC m=+0.014797633 LudicrousMode:false LudicrousConcurrency:2000 EncryptionKey:**** LogRequest:0 HardSync:false}
I0318 11:38:35.379133      19 run.go:698] worker.Config: {PostingDir:p PostingDirCompression:1 PostingDirCompressionLevel:0 WALDir:w MutationsMode:0 AuthToken: PBlockCacheSize:697932185 PIndexCacheSize:375809638 WalCache:0 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35,0 CacheMb:0}
I0318 11:38:35.379282      19 log.go:295] Found file: 37 First Index: 0
I0318 11:38:35.380563      19 storage.go:132] Init Raft Storage with snap: 0, first: 1, last: 0
I0318 11:38:35.380585      19 server_state.go:76] Setting Posting Dir Compression Level: 0
I0318 11:38:35.380591      19 server_state.go:120] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x2e0fef8 Compression:1 InMemory:false MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0318 11:38:35.387582      19 log.go:34] All 0 tables opened in 0s
I0318 11:38:35.387922      19 log.go:34] Discard stats nextEmptySlot: 0
I0318 11:38:35.387944      19 log.go:34] Set nextTxnTs to 0
I0318 11:38:35.388016      19 log.go:34] Deleting empty file: p/000036.vlog
E0318 11:38:35.388564      19 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": Unable to find any servers for group: 1. closer err: <nil>
I0318 11:38:35.388580      19 groups.go:99] Current Raft Id: 0x3
I0318 11:38:35.388783      19 worker.go:104] Worker listening at address: [::]:7080
E0318 11:38:35.389923      19 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0318 11:38:35.389938      19 run.go:519] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0318 11:38:35.389952      19 run.go:520] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0318 11:38:35.389983      19 run.go:552] gRPC server started.  Listening on port 9080
I0318 11:38:35.389996      19 run.go:553] HTTP server started.  Listening on port 8080
I0318 11:38:35.489457      19 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0318 11:38:35.493871      19 pool.go:162] CONNECTING to dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080
[Sentry] 2021/03/18 11:38:35 Sending fatal event [bf638e8034de402888a1bd942a576349] to o318308.ingest.sentry.io project: 1805390
2021/03/18 11:38:35 rpc error: code = Unknown desc = REUSE_RAFTID: Duplicate Raft ID 3 to removed member: id:3 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080" last_update:1615991006 

This looks like bad management of your cluster. You should not reuse the same Raft ID. Or if you are trying to create a new Alpha, you should use a new volume or cleanup the path that Alpha will be in.

The other error above I’m not sure, but can be related to the bad management and mixing of paths with new alphas.

1 Like

Yes, thanks for your reminder, I did generate a new raft id to join the cluster after restarting the pod with the new pvc.

Sorry, I just noticed that although the node is started normally, but the service is not available, can you help me see it again?

[Sentry] 2021/03/19 05:54:02 Integration installed: ContextifyFrames
[Sentry] 2021/03/19 05:54:02 Integration installed: Environment
[Sentry] 2021/03/19 05:54:02 Integration installed: Modules
[Sentry] 2021/03/19 05:54:02 Integration installed: IgnoreErrors
[Decoder]: Using assembly version of decoder
Page Size: 4096
[Sentry] 2021/03/19 05:54:02 Integration installed: ContextifyFrames
[Sentry] 2021/03/19 05:54:02 Integration installed: Environment
[Sentry] 2021/03/19 05:54:02 Integration installed: Modules
[Sentry] 2021/03/19 05:54:02 Integration installed: IgnoreErrors
I0319 05:54:03.086588      19 sentry_integration.go:48] This instance of Dgraph will send anonymous reports of panics back to Dgraph Labs via Sentry. No confidential information is sent. These reports help improve Dgraph. To opt-out, restart your instance with the --enable_sentry=false flag. For more info, see https://dgraph.io/docs/howto/#data-handling.
I0319 05:54:03.300010      19 init.go:107] 

Dgraph version   : v20.11.2
Dgraph codename  : tchalla-2
Dgraph SHA-256   : 0153cb8d3941ad5ad107e395b347e8d930a0b4ead6f4524521f7a525a9699167
Commit SHA-1     : 94f3a0430
Commit timestamp : 2021-02-23 13:07:17 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs/.
For discussions about Dgraph     , visit http://discuss.dgraph.io.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0319 05:54:03.300045      19 run.go:696] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000 MutationsNQuadLimit:1000000 PollInterval:1s GraphqlExtension:true GraphqlDebug:false GraphqlLambdaUrl:}
I0319 05:54:03.300082      19 run.go:697] x.WorkerConfig: {TmpDir:t ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr:dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080 ZeroAddr:[dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080 dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2021-03-19 05:54:02.664257482 +0000 UTC m=+0.014777776 LudicrousMode:false LudicrousConcurrency:2000 EncryptionKey:**** LogRequest:0 HardSync:false}
I0319 05:54:03.300131      19 run.go:698] worker.Config: {PostingDir:p PostingDirCompression:1 PostingDirCompressionLevel:0 WALDir:w MutationsMode:0 AuthToken: PBlockCacheSize:697932185 PIndexCacheSize:375809638 WalCache:0 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35,0 CacheMb:0}
I0319 05:54:03.300288      19 log.go:295] Found file: 2 First Index: 0
I0319 05:54:03.301570      19 storage.go:132] Init Raft Storage with snap: 0, first: 1, last: 0
I0319 05:54:03.301592      19 server_state.go:76] Setting Posting Dir Compression Level: 0
I0319 05:54:03.301602      19 server_state.go:120] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x2e0fef8 Compression:1 InMemory:false MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0319 05:54:03.308202      19 log.go:34] All 0 tables opened in 0s
I0319 05:54:03.308555      19 log.go:34] Discard stats nextEmptySlot: 0
I0319 05:54:03.308579      19 log.go:34] Set nextTxnTs to 0
I0319 05:54:03.308680      19 log.go:34] Deleting empty file: p/000001.vlog
E0319 05:54:03.309366      19 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\vdgraph.cors\x00": Unable to find any servers for group: 1. closer err: <nil>
I0319 05:54:03.309395      19 groups.go:99] Current Raft Id: 0x5
I0319 05:54:03.309469      19 worker.go:104] Worker listening at address: [::]:7080
I0319 05:54:03.310726      19 run.go:519] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
E0319 05:54:03.310723      19 groups.go:1143] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0319 05:54:03.310747      19 run.go:520] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0319 05:54:03.310774      19 run.go:552] gRPC server started.  Listening on port 9080
I0319 05:54:03.310783      19 run.go:553] HTTP server started.  Listening on port 8080
I0319 05:54:03.409805      19 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080
I0319 05:54:03.413999      19 pool.go:162] CONNECTING to dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080
I0319 05:54:03.417395      19 groups.go:127] Connected to group zero. Assigned group: 0
I0319 05:54:03.417410      19 groups.go:129] Raft Id after connection to Zero: 0x5
I0319 05:54:03.417461      19 pool.go:162] CONNECTING to dgraph-alpha-1.dgraph-alpha.crm-test.svc.cluster.local:7080
I0319 05:54:03.417484      19 pool.go:162] CONNECTING to dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080
I0319 05:54:03.417517      19 pool.go:162] CONNECTING to dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080
I0319 05:54:03.417543      19 draft.go:230] Node ID: 0x5 with GroupID: 1
I0319 05:54:03.417602      19 node.go:152] Setting raft.Config to: &{ID:5 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00031ab40 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2e0fef8 DisableProposalForwarding:false}
I0319 05:54:03.417720      19 node.go:321] Found hardstate: {Term:26 Vote:0 Commit:0 XXX_unrecognized:[]}
I0319 05:54:03.417736      19 node.go:326] Group 1 found 0 entries
I0319 05:54:03.417741      19 draft.go:1689] Restarting node for group: 1
I0319 05:54:03.417757      19 log.go:34] 5 became follower at term 26
I0319 05:54:03.417765      19 log.go:34] newRaft 5 [peers: [], term: 26, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
I0319 05:54:03.417782      19 draft.go:180] Operation started with id: opRollup
I0319 05:54:03.417866      19 groups.go:807] Got address of a Zero leader: dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080
I0319 05:54:03.417864      19 draft.go:1084] Found Raft progress: 0
I0319 05:54:03.417979      19 groups.go:821] Starting a new membership stream receive from dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080.
I0319 05:54:03.418953      19 groups.go:838] Received first state update from Zero: counter:76883 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha.crm-test.svc.cluster.local:7080" last_update:1615990975 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-alpha-1.dgraph-alpha.crm-test.svc.cluster.local:7080" leader:true last_update:1616058242 > > members:<key:5 value:<id:5 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080" > > tablets:<key:"account_relation" value:<group_id:1 predicate:"account_relation" > > tablets:<key:"create_time" value:<group_id:1 predicate:"create_time" on_disk_bytes:192143375 uncompressed_bytes:840031838 > > tablets:<key:"dgraph.cors" value:<group_id:1 predicate:"dgraph.cors" on_disk_bytes:199 uncompressed_bytes:75 > > tablets:<key:"dgraph.drop.op" value:<group_id:1 predicate:"dgraph.drop.op" > > tablets:<key:"dgraph.graphql.p_query" value:<group_id:1 predicate:"dgraph.graphql.p_query" > > tablets:<key:"dgraph.graphql.p_sha256hash" value:<group_id:1 predicate:"dgraph.graphql.p_sha256hash" > > tablets:<key:"dgraph.graphql.schema" value:<group_id:1 predicate:"dgraph.graphql.schema" > > tablets:<key:"dgraph.graphql.schema_created_at" value:<group_id:1 predicate:"dgraph.graphql.schema_created_at" > > tablets:<key:"dgraph.graphql.schema_history" value:<group_id:1 predicate:"dgraph.graphql.schema_history" > > tablets:<key:"dgraph.graphql.xid" value:<group_id:1 predicate:"dgraph.graphql.xid" > > tablets:<key:"dgraph.type" value:<group_id:1 predicate:"dgraph.type" on_disk_bytes:183491132 uncompressed_bytes:661563607 > > tablets:<key:"identity_id" value:<group_id:1 predicate:"identity_id" on_disk_bytes:668112128 uncompressed_bytes:1123320948 > > tablets:<key:"model_type" value:<group_id:1 predicate:"model_type" > > tablets:<key:"namespace" value:<group_id:1 predicate:"namespace" on_disk_bytes:118698642 uncompressed_bytes:441263005 > > tablets:<key:"primary" value:<group_id:1 predicate:"primary" on_disk_bytes:34194501 uncompressed_bytes:125213558 > > tablets:<key:"relation" value:<group_id:1 predicate:"relation" on_disk_bytes:141790681 uncompressed_bytes:303153262 > > tablets:<key:"source" value:<group_id:1 predicate:"source" on_disk_bytes:133531354 uncompressed_bytes:286461038 > > tablets:<key:"source_id" value:<group_id:1 predicate:"source_id" on_disk_bytes:326471204 uncompressed_bytes:564324005 > > tablets:<key:"source_type" value:<group_id:1 predicate:"source_type" on_disk_bytes:28049557 uncompressed_bytes:112303541 > > tablets:<key:"tenant_id" value:<group_id:1 predicate:"tenant_id" on_disk_bytes:366011308 uncompressed_bytes:793126415 > > tablets:<key:"type" value:<group_id:1 predicate:"type" on_disk_bytes:130700470 uncompressed_bytes:449464903 > > snapshot_ts:328729 checksum:12249075648478441628 > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero.crm-test.svc.cluster.local:5080" > > zeros:<key:2 value:<id:2 addr:"dgraph-zero-1.dgraph-zero.crm-test.svc.cluster.local:5080" leader:true > > zeros:<key:3 value:<id:3 addr:"dgraph-zero-2.dgraph-zero.crm-test.svc.cluster.local:5080" > > maxLeaseId:15200000 maxTxnTs:350000 maxRaftId:5 removed:<id:3 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080" last_update:1615991006 > removed:<id:4 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha.crm-test.svc.cluster.local:7080" > cid:"fec6a030-0c07-4fe9-a775-219482e41177" license:<maxNodes:18446744073709551615 expiryTs:1617290026 enabled:true > 
I0319 05:54:08.310892      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:13.311046      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:18.311204      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:23.311344      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:28.311539      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:33.311694      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:36.575308      19 log.go:34] raft.node: 5 elected leader 2 at term 26
I0319 05:54:38.311822      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:43.311980      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
I0319 05:54:48.312119      19 admin.go:686] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.

After the restart, the synchronization schema failed. The node that was rejoined was restarted several times. Observing the alpha leader log found that every time the leader sends a synchronization message, it is sent to the node IP before the restart, resulting in a timeout.
Before restart:

Name:           dgraph-alpha-2
Namespace:      crm-test
Priority:       0
Node:           yq01-qianmo-f12-ssd-com-61-169-35.yq01.baidu.com/10.61.169.35
Start Time:     Fri, 19 Mar 2021 14:48:21 +0800
Labels:         app=dgraph-alpha
                controller-revision-hash=dgraph-alpha-7759bb686f
                statefulset.kubernetes.io/pod-name=dgraph-alpha-2
Annotations:    cni.projectcalico.org/podIP: 192.168.177.30/32
                cni.projectcalico.org/podIPs: 192.168.177.30/32
                kubectl.kubernetes.io/restartedAt: 2021-03-16T22:41:01+08:00
                sidecar.istio.io/inject: false
Status:         Running
IP:             192.168.177.30
Controlled By:  StatefulSet/dgraph-alpha

leader:

W0319 06:48:30.798970      19 node.go:420] Unable to send message to peer: 0x5. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.177.28:7080: i/o timeout"
W0319 06:48:31.874221      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:41.974243      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:52.074262      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection

After restart:

Name:           dgraph-alpha-2
Namespace:      crm-test
Priority:       0
Node:           yq01-qianmo-f12-ssd-com-61-169-35.yq01.baidu.com/10.61.169.35
Start Time:     Fri, 19 Mar 2021 14:56:41 +0800
Labels:         app=dgraph-alpha
                controller-revision-hash=dgraph-alpha-7759bb686f
                statefulset.kubernetes.io/pod-name=dgraph-alpha-2
Annotations:    cni.projectcalico.org/podIP: 192.168.177.36/32
                cni.projectcalico.org/podIPs: 192.168.177.36/32
                kubectl.kubernetes.io/restartedAt: 2021-03-16T22:41:01+08:00
                sidecar.istio.io/inject: false
Status:         Running
IP:             192.168.177.36
Controlled By:  StatefulSet/dgraph-alpha

leader:

W0319 06:48:30.798970      19 node.go:420] Unable to send message to peer: 0x5. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.177.28:7080: i/o timeout"
W0319 06:48:31.874221      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:41.974243      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:52.074262      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
I0319 06:49:03.885935      19 log.go:34] Block cache metrics: hit: 58342 miss: 1581724 keys-added: 340853 keys-updated: 15 keys-evicted: 184922 cost-added: 1510751754 cost-evicted: 812820598 sets-dropped: 0 sets-rejected: 1240578 gets-dropped: 21568 gets-kept: 1584064 gets-total: 1640066 hit-ratio: 0.04
I0319 06:54:03.885941      19 log.go:34] Block cache metrics: hit: 58342 miss: 1581724 keys-added: 340853 keys-updated: 15 keys-evicted: 184922 cost-added: 1510751754 cost-evicted: 812820598 sets-dropped: 0 sets-rejected: 1240578 gets-dropped: 21568 gets-kept: 1584064 gets-total: 1640066 hit-ratio: 0.04
W0319 06:56:55.091999      19 node.go:420] Unable to send message to peer: 0x5. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.177.30:7080: i/o timeout"
W0319 06:56:56.174319      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:57:06.274300      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:57:16.374332      19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection

At this point, I would recommend starting from scratch instead of troubleshooting every single step. e.g. you said you were using K8s. But now your logs says it was configured to “Lan IP”. Which isn’t the way that K8s was working. That means you have messed a lot of the configs here and there. The best way is to clean up the whole thing and start from zero. As fixing configs mistakes isn’t easy.

Okay, thank you for your prompt. I will try to remove the node again. By the way, what configuration will cause it to communicate through the “Lan IP” method?

I don’t understand why you have that log like that. If you are using some YAML that we share in our repos, you should not change anything.

Yes, just changed some storage configuration, this is the YAML we are using, please see if there is any problem.

# This highly available config creates 3 Dgraph Zeros, 3 Dgraph
# Alphas with 3 replicas, and 1 Ratel UI client. The Dgraph cluster
# will still be available to service requests even when one Zero
# and/or one Alpha are down.
#
# There are 3 services can can be used to expose outside the cluster as needed:
#       dgraph-zero-public - To load data using Live & Bulk Loaders
#       dgraph-alpha-public - To connect clients and for HTTP APIs
#       dgraph-ratel-public - For Dgraph UI
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero-public
  labels:
    app: dgraph-zero
    monitor: zero-dgraph-io
spec:
  type: ClusterIP
  ports:
  - port: 5080
    targetPort: 5080
    name: grpc-zero
  - port: 6080
    targetPort: 6080
    name: http-zero
  selector:
    app: dgraph-zero
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-alpha-public
  labels:
    app: dgraph-alpha
    monitor: alpha-dgraph-io
spec:
  type: ClusterIP
  ports:
  - port: 8080
    targetPort: 8080
    name: http-alpha
  - port: 9080
    targetPort: 9080
    name: grpc-alpha
  selector:
    app: dgraph-alpha
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-ratel-public
  labels:
    app: dgraph-ratel
spec:
  type: ClusterIP
  ports:
  - port: 8000
    targetPort: 8000
    name: http-ratel
  selector:
    app: dgraph-ratel
---
# This is a headless service which is necessary for discovery for a dgraph-zero StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero
  labels:
    app: dgraph-zero
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: grpc-zero
  clusterIP: None
  # We want all pods in the StatefulSet to have their addresses published for
  # the sake of the other Dgraph Zero pods even before they're ready, since they
  # have to be able to talk to each other in order to become ready.
  publishNotReadyAddresses: true
  selector:
    app: dgraph-zero
---
# This is a headless service which is necessary for discovery for a dgraph-alpha StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-alpha
  labels:
    app: dgraph-alpha
spec:
  ports:
  - port: 7080
    targetPort: 7080
    name: grpc-alpha-int
  clusterIP: None
  # We want all pods in the StatefulSet to have their addresses published for
  # the sake of the other Dgraph alpha pods even before they're ready, since they
  # have to be able to talk to each other in order to become ready.
  publishNotReadyAddresses: true
  selector:
    app: dgraph-alpha
---
# This StatefulSet runs 3 Dgraph Zero.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph-zero
spec:
  serviceName: "dgraph-zero"
  replicas: 3
  selector:
    matchLabels:
      app: dgraph-zero
  
  template:
    metadata:
      labels:
        app: dgraph-zero
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-zero
              topologyKey: kubernetes.io/hostname
      containers:
      - name: zero
        image: iregistry.baidu-int.com/bizcrm/dgraph/dgraph:v20.11.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5080
          name: grpc-zero
        - containerPort: 6080
          name: http-zero
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - bash
          - "-c"
          - |
            set -ex
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            idx=$(($ordinal + 1))
            if [[ $ordinal -eq 0 ]]; then
              exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 3
            else
              exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 3
            fi
        livenessProbe:
          httpGet:
            path: /health
            port: 6080
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 6
          successThreshold: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 6080
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 6
          successThreshold: 1
      terminationGracePeriodSeconds: 60
      nodeSelector:
        disk_type: ssd
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: local-path-extra
      volumeMode: Filesystem
---
# This StatefulSet runs 3 replicas of Dgraph Alpha.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph-alpha
spec:
  serviceName: "dgraph-alpha"
  replicas: 3
  selector:
    matchLabels:
      app: dgraph-alpha
  template:
    metadata:
      labels:
        app: dgraph-alpha
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-alpha
              topologyKey: kubernetes.io/hostname
      # Initializing the Alphas:
      #
      # You may want to initialize the Alphas with data before starting, e.g.
      # with data from the Dgraph Bulk Loader: https://dgraph.io/docs/deploy/#bulk-loader.
      # You can accomplish by uncommenting this initContainers config. This
      # starts a container with the same /dgraph volume used by Alpha and runs
      # before Alpha starts.
      #
      # You can copy your local p directory to the pod's /dgraph/p directory
      # with this command:
      #
      #    kubectl cp path/to/p dgraph-alpha-0:/dgraph/ -c init-alpha
      #    (repeat for each alpha pod)
      #
      # When you're finished initializing each Alpha data directory, you can signal
      # it to terminate successfully by creating a /dgraph/doneinit file:
      #
      #    kubectl exec dgraph-alpha-0 -c init-alpha touch /dgraph/doneinit
      #
      # Note that pod restarts cause re-execution of Init Containers. Since
      # /dgraph is persisted across pod restarts, the Init Container will exit
      # automatically when /dgraph/doneinit is present and proceed with starting
      # the Alpha process.
      #
      # Tip: StatefulSet pods can start in parallel by configuring
      # .spec.podManagementPolicy to Parallel:
      #
      #     https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees
      #
      # initContainers:
      #   - name: init-alpha
      #     image: dgraph/dgraph:latest
      #     command:
      #       - bash
      #       - "-c"
      #       - |
      #         trap "exit" SIGINT SIGTERM
      #         echo "Write to /dgraph/doneinit when ready."
      #         until [ -f /dgraph/doneinit ]; do sleep 2; done
      #     volumeMounts:
      #       - name: datadir
      #         mountPath: /dgraph
      containers:
      - name: alpha
        image: iregistry.baidu-int.com/bizcrm/dgraph/dgraph:v20.11.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 7080
          name: grpc-alpha-int
        - containerPort: 8080
          name: http-alpha
        - containerPort: 9080
          name: grpc-alpha
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        env:
          # This should be the same namespace as the dgraph-zero
          # StatefulSet to resolve a Dgraph Zero's DNS name for
          # Alpha's --zero flag.
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        # dgraph versions earlier than v1.2.3 and v20.03.0 can only support one zero:
        #  `dgraph alpha --zero dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080`
        # dgraph-alpha versions greater than or equal to v1.2.3 or v20.03.1 can support
        #  a comma-separated list of zeros.  The value below supports 3 zeros
        #  (set according to number of replicas)
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph alpha --my=$(hostname -f):7080 --zero dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-zero-1.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-zero-2.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080
        livenessProbe:
          httpGet:
            path: /health?live=1
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 6
          successThreshold: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 6
          successThreshold: 1
      terminationGracePeriodSeconds: 600
      nodeSelector:
        disk_type: ssd
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: local-path-extra
      volumeMode: Filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dgraph-ratel
  labels:
    app: dgraph-ratel
spec:
  selector:
    matchLabels:
      app: dgraph-ratel
  template:
    metadata:
      labels:
        app: dgraph-ratel
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      containers:
      - name: ratel
        image: iregistry.baidu-int.com/bizcrm/dgraph/dgraph:v20.11.2
        ports:
        - containerPort: 8000
        command:
          - dgraph-ratel
      nodeSelector:
        disk_type: ssd

It looks fine, changing storage doesn’t do any harm. But it is odd that you have a local config instead of the svc one.

Try the following. Start from scratch. Clean the whole K8s node and possible bound paths. Clean the storage itself. If the issue continues, share what else you are doing.

Don’t delete it. It should recover with time - are you removing the alpha from the cluster? don’t do it. In the other ticked you said you have lost Alphas due to the low resources. Fix this first and then start from scratch. Prevention is better than focusing on repairing.

1 Like

Thank you. After I cleaned up the node again, the node started normally. It is indeed data synchronization through svc. Cheers!

I recently saw this error when the zero and alpha are out of sync.

Specifically, I started the zero process in the wrong directory, so it created a new zw directory, and when the alpha tried to connect to the zero, having some metadata from the existing (correct) p and w directories with metadata about the zero, this error was returned.

To fix, I had to reset the metadata, by deleting the zeros zw directory and the alphas w directory (just a p dir left) and then manually setting the max timestamp and uid via curl:

curl “localhost:6080/assign?what=uids&num=2000000”
curl “localhost:6080/assign?what=timestamps&num=2000000”

For my DB, the 2000000 number was higher than the existing UID max used, and the max timestamp assigned. There are ways to get these numbers from the p directory using dgraph debug if you don’t know a safe number.

Full steps:

  • shut down alpha(s) and zero(s).
  • delete everything but the p dir. (this may lose some in-flight transaction info)
  • start a zero (it will create an empty zw dir, with timestamp and maxUID set to 0 or a min value)
  • run the two curl commands (plus one more if you use namespaces to set max namespace id)
  • start the alpha
1 Like