Power outage and now the cluster didn't start again

Hi,

I’m running a cluster on my locale machine. the setup is 3 alpha and 1 zero.

I have the issue that after a power outage, my cluster didn’t start up.

See the error log below.

I already followed the recommendation in the past the clean the alpha directory. This worked.
But at other times I used the same directories while upgrading dgraph version.
This time the cluster was running for 2 months, without any upgrades and without any issues.
Until now.

As there was no version upgrade, just a power outage, which means a hard shutdown. how to recover from this without deleting all my data again?

Dgraph version   : v22.0.2
Dgraph codename  : dgraph
Dgraph SHA-256   : a11258bf3352eff0521bc68983a5aedb9316414947719920d75f12143dd368bd
Commit SHA-1     : 55697a4
Commit timestamp : 2022-12-16 23:03:35 +0000
Branch           : release/v22.0.2
Go version       : go1.18.5
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud   , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2022 Dgraph Labs, Inc.
I0221 18:36:36.591444      15 run.go:754] x.Config: {PortOffset:0 Limit:mutations=allow; normalize-node=10000; disallow-drop=false; query-timeout=0ms; query-edge=1000000; mutations-nquad=1000000; txn-abort-after=5m; max-retries=-1; max-pending-queries=10000 LimitMutationsNquad:1000000 LimitQueryEdge:1000000 BlockClusterWideDrop:false LimitNormalizeNode:10000 QueryTimeout:0s MaxRetries:-1 GraphQL:introspection=true; debug=false; extensions=true; poll-interval=1s; lambda-url= GraphQLDebug:false}
[Sentry] 2023/02/21 18:36:32 Sending fatal event [57586dcf4137421587f662673d10ae2e] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2023/02/21 18:36:32 Too many requests, backing off till: 2023-02-21 18:36:51.15722936 +0000 UTC m=+20.210434431
[Sentry] 2023/02/21 18:36:32 Buffer flushed successfully.
[Sentry] 2023/02/21 18:36:35 Integration installed: ContextifyFrames
[Sentry] 2023/02/21 18:36:35 Integration installed: Environment
[Sentry] 2023/02/21 18:36:35 Integration installed: Modules
[Sentry] 2023/02/21 18:36:35 Integration installed: IgnoreErrors
[Sentry] 2023/02/21 18:36:36 Integration installed: ContextifyFrames
[Sentry] 2023/02/21 18:36:36 Integration installed: Environment
[Sentry] 2023/02/21 18:36:36 Integration installed: Modules
[Sentry] 2023/02/21 18:36:36 Integration installed: IgnoreErrors
I0221 18:36:36.591489      15 run.go:755] x.WorkerConfig: {TmpDir:t ExportPath:export Trace:ratio=0.01; jaeger=; datadog= MyAddr:alpha2:7080 ZeroAddr:[zero:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> Raft:snapshot-after-entries=10000; snapshot-after-duration=30m; pending-proposals=256; idx=; group=; learner=false Badger:{Dir: ValueDir: SyncWrites:false NumVersionsToKeep:1 ReadOnly:false Logger:0xc00037a740 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:true NamespaceOffset:-1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0} WhiteListedIPRanges:[{Lower:10.0.0.0 Upper:10.255.255.255} {Lower:172.0.0.0 Upper:172.255.255.255} {Lower:192.168.0.0 Upper:192.168.255.255}] StrictMutations:false AclEnabled:false HmacSecret:**** AbortOlderThan:5m0s ProposedGroupId:0 StartTime:2023-02-21 18:36:36.190396318 +0000 UTC m=+0.151868611 Ludicrous:enabled=false; concurrency=2000 LudicrousEnabled:false Security:whitelist=10.0.0.0/8,172.0.0.0/8,192.168.0.0/16; token= EncryptionKey:**** LogDQLRequest:0 HardSync:false Audit:false}
I0221 18:36:36.591554      15 run.go:756] worker.Config: {PostingDir:p WALDir:w MutationsMode:0 AuthToken: HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35 CacheMb:1024 Audit:<nil> ChangeDataConf:file=; kafka=; sasl_user=; sasl_password=; ca_cert=; client_cert=; client_key=; sasl-mechanism=PLAIN;}
I0221 18:36:36.594677      15 log.go:295] Found file: 1 First Index: 1
I0221 18:36:36.595466      15 storage.go:125] Init Raft Storage with snap: 150, first: 151, last: 6885
I0221 18:36:36.595968      15 server_state.go:140] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x3113340 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
I0221 18:36:36.616155      15 log.go:34] All 5 tables opened in 2ms
I0221 18:36:36.617050      15 log.go:34] Discard stats nextEmptySlot: 0
I0221 18:36:36.617105      15 log.go:34] Set nextTxnTs to 230127
I0221 18:36:36.620390      15 log.go:34] Deleting empty file: p/000058.vlog
I0221 18:36:36.624162      15 groups.go:100] Current Raft Id: 0x3
I0221 18:36:36.624241      15 groups.go:116] Sending member request to Zero: id:3 addr:"alpha2:7080" 
I0221 18:36:36.624363      15 worker.go:114] Worker listening at address: [::]:7080
I0221 18:36:36.625050      15 run.go:567] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0221 18:36:36.625060      15 run.go:568] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0221 18:36:36.625066      15 run.go:595] gRPC server started.  Listening on port 9080
I0221 18:36:36.625070      15 run.go:596] HTTP server started.  Listening on port 8080
E0221 18:36:36.625087      15 groups.go:1224] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0221 18:36:36.725500      15 pool.go:164] CONN: Connecting to zero:5080
I0221 18:36:36.728163      15 groups.go:135] Connected to group zero. Assigned group: 0
I0221 18:36:36.728190      15 groups.go:137] Raft Id after connection to Zero: 0x3
I0221 18:36:36.728243      15 pool.go:164] CONN: Connecting to alpha1:7080
I0221 18:36:36.728263      15 draft.go:270] Node ID: 0x3 with GroupID: 1
I0221 18:36:36.728267      15 draft.go:279] RaftContext: id:3 group:1 addr:"alpha2:7080" 
I0221 18:36:36.728315      15 node.go:153] Setting raft.Config to: &{ID:3 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00044e700 Applied:150 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x3113340 DisableProposalForwarding:false}
I0221 18:36:36.732598      15 node.go:312] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:150 Term:5 XXX_unrecognized:[]}
I0221 18:36:36.732674      15 node.go:323] Found hardstate: {Term:30 Vote:3 Commit:6886 XXX_unrecognized:[]}
I0221 18:36:36.736962      15 node.go:328] Group 1 found 6885 entries
I0221 18:36:36.737011      15 draft.go:1828] Restarting node for group: 1
I0221 18:36:36.737031      15 node.go:190] Setting conf state to nodes:1 nodes:2 nodes:3 
2023/02/21 18:36:36 3 state.commit 6886 is out of range [150, 6885]
panic: 3 state.commit 6886 is out of range [150, 6885]
goroutine 203 [running]:
log.Panicf({0x1f5c0b6?, 0xaf3305?}, {0xc0004b1580?, 0x1c949c0?, 0x7fc2dcb3bf00?})
/opt/hostedtoolcache/go/1.18.5/x64/src/log/log.go:392 +0x67
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x1ae5?, {0x1f5c0b6?, 0xaf9ff4?}, {0xc0004b1580?, 0xaf5a87?, 0x40?})
/home/runner/work/dgraph/dgraph/x/log.go:40 +0x32
go.etcd.io/etcd/raft.(*raft).loadState(0xc000376780, {0x1e, 0x3, 0x1ae6, {0x0, 0x0, 0x0}})
/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20190108173120-83c051b701d3/raft/raft.go:1475 +0x19a
go.etcd.io/etcd/raft.newRaft(0xc000336f20)
/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20190108173120-83c051b701d3/raft/raft.go:377 +0x791
go.etcd.io/etcd/raft.RestartNode(0xc000336f20)
/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20190108173120-83c051b701d3/raft/node.go:242 +0x33
github.com/dgraph-io/dgraph/worker.(*node).InitAndStartNode(0xc000422500)
/home/runner/work/dgraph/dgraph/worker/draft.go:1854 +0x6f2
github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc00044e700, 0x1?)
/home/runner/work/dgraph/dgraph/worker/groups.go:155 +0x635
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run.func4()
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:814 +0x34
created by github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:813 +0x14b1
W0221 18:36:36.747097       1 sentry_integration.go:140] unable to read CID from file /tmp/dgraph-alpha-cid-sentry open /tmp/dgraph-alpha-cid-sentry: no such file or directory. Skip
[Sentry] 2023/02/21 18:36:36 Sending fatal event [f5a6939fde6a49c28445d4dd5a95bcb8] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2023/02/21 18:36:36 Buffer flushed successfully.

Just downgrade to your previous version. BTW, if you are using the tag “latest” in your docker compose or something. Remove it now, it is not recommended to prod.

In general you have to export your data if you jump for example from 21.0.1 to 22.0.2 if it is a minor it won’t hurt if you upgrade, but medium or major upgrades you have to export and reimport.

sorry if I misspelled,

I had the issue without any version change of dgraph.
I started version 22.0.2 with a completely clean/empty set.

all worked fine, and then after the outage, I got the error above. The version was untouched.

@mko0815 will it be possible for you to share one of the p directories if it is not proprietary, we will be happy to take a look? You could share it on aman@dgraph.io for now.

I send you an email,
thank you.

1 Like

This implies that there is a conflict, there is some node stating that the transaction is in a state X, but another node is with another different state of Commit from that known. This is certainly due to the reuse of the same PATH containing files from another cluster executed previously there. That is, it created a new cluster without proper cleaning.

So if this happened because of an outage. It really can be a problem for Dgraph, that may not handled well with the event.

all 3 nodes have this error.
No node was able to start after the power outage.

I now have the issue again.

I upgraded on Friday to 23-beta-1 all worked fine.
Then during the weekend, there was a power outage again, and now 2 out of 3 alphas didn’t start because of the same error.

Can you open an ticket ar Sign in to GitHub · GitHub?
So I can assign someone to take a look.

created a ticket: