Power outage and now the cluster didn't start again

mko0815 · February 21, 2023, 6:46pm

Hi,

I’m running a cluster on my locale machine. the setup is 3 alpha and 1 zero.

I have the issue that after a power outage, my cluster didn’t start up.

See the error log below.

I already followed the recommendation in the past the clean the alpha directory. This worked.
But at other times I used the same directories while upgrading dgraph version.
This time the cluster was running for 2 months, without any upgrades and without any issues.
Until now.

As there was no version upgrade, just a power outage, which means a hard shutdown. how to recover from this without deleting all my data again?

Dgraph version   : v22.0.2
Dgraph codename  : dgraph
Dgraph SHA-256   : a11258bf3352eff0521bc68983a5aedb9316414947719920d75f12143dd368bd
Commit SHA-1     : 55697a4
Commit timestamp : 2022-12-16 23:03:35 +0000
Branch           : release/v22.0.2
Go version       : go1.18.5
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud   , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2022 Dgraph Labs, Inc.
I0221 18:36:36.591444      15 run.go:754] x.Config: {PortOffset:0 Limit:mutations=allow; normalize-node=10000; disallow-drop=false; query-timeout=0ms; query-edge=1000000; mutations-nquad=1000000; txn-abort-after=5m; max-retries=-1; max-pending-queries=10000 LimitMutationsNquad:1000000 LimitQueryEdge:1000000 BlockClusterWideDrop:false LimitNormalizeNode:10000 QueryTimeout:0s MaxRetries:-1 GraphQL:introspection=true; debug=false; extensions=true; poll-interval=1s; lambda-url= GraphQLDebug:false}
[Sentry] 2023/02/21 18:36:32 Sending fatal event [57586dcf4137421587f662673d10ae2e] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2023/02/21 18:36:32 Too many requests, backing off till: 2023-02-21 18:36:51.15722936 +0000 UTC m=+20.210434431
[Sentry] 2023/02/21 18:36:32 Buffer flushed successfully.
[Sentry] 2023/02/21 18:36:35 Integration installed: ContextifyFrames
[Sentry] 2023/02/21 18:36:35 Integration installed: Environment
[Sentry] 2023/02/21 18:36:35 Integration installed: Modules
[Sentry] 2023/02/21 18:36:35 Integration installed: IgnoreErrors
[Sentry] 2023/02/21 18:36:36 Integration installed: ContextifyFrames
[Sentry] 2023/02/21 18:36:36 Integration installed: Environment
[Sentry] 2023/02/21 18:36:36 Integration installed: Modules
[Sentry] 2023/02/21 18:36:36 Integration installed: IgnoreErrors
I0221 18:36:36.591489      15 run.go:755] x.WorkerConfig: {TmpDir:t ExportPath:export Trace:ratio=0.01; jaeger=; datadog= MyAddr:alpha2:7080 ZeroAddr:[zero:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> Raft:snapshot-after-entries=10000; snapshot-after-duration=30m; pending-proposals=256; idx=; group=; learner=false Badger:{Dir: ValueDir: SyncWrites:false NumVersionsToKeep:1 ReadOnly:false Logger:0xc00037a740 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:true NamespaceOffset:-1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0} WhiteListedIPRanges:[{Lower:10.0.0.0 Upper:10.255.255.255} {Lower:172.0.0.0 Upper:172.255.255.255} {Lower:192.168.0.0 Upper:192.168.255.255}] StrictMutations:false AclEnabled:false HmacSecret:**** AbortOlderThan:5m0s ProposedGroupId:0 StartTime:2023-02-21 18:36:36.190396318 +0000 UTC m=+0.151868611 Ludicrous:enabled=false; concurrency=2000 LudicrousEnabled:false Security:whitelist=10.0.0.0/8,172.0.0.0/8,192.168.0.0/16; token= EncryptionKey:**** LogDQLRequest:0 HardSync:false Audit:false}
I0221 18:36:36.591554      15 run.go:756] worker.Config: {PostingDir:p WALDir:w MutationsMode:0 AuthToken: HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35 CacheMb:1024 Audit:<nil> ChangeDataConf:file=; kafka=; sasl_user=; sasl_password=; ca_cert=; client_cert=; client_key=; sasl-mechanism=PLAIN;}
I0221 18:36:36.594677      15 log.go:295] Found file: 1 First Index: 1
I0221 18:36:36.595466      15 storage.go:125] Init Raft Storage with snap: 150, first: 151, last: 6885
I0221 18:36:36.595968      15 server_state.go:140] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x3113340 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
I0221 18:36:36.616155      15 log.go:34] All 5 tables opened in 2ms
I0221 18:36:36.617050      15 log.go:34] Discard stats nextEmptySlot: 0
I0221 18:36:36.617105      15 log.go:34] Set nextTxnTs to 230127
I0221 18:36:36.620390      15 log.go:34] Deleting empty file: p/000058.vlog
I0221 18:36:36.624162      15 groups.go:100] Current Raft Id: 0x3
I0221 18:36:36.624241      15 groups.go:116] Sending member request to Zero: id:3 addr:"alpha2:7080" 
I0221 18:36:36.624363      15 worker.go:114] Worker listening at address: [::]:7080
I0221 18:36:36.625050      15 run.go:567] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0221 18:36:36.625060      15 run.go:568] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0221 18:36:36.625066      15 run.go:595] gRPC server started.  Listening on port 9080
I0221 18:36:36.625070      15 run.go:596] HTTP server started.  Listening on port 8080
E0221 18:36:36.625087      15 groups.go:1224] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0221 18:36:36.725500      15 pool.go:164] CONN: Connecting to zero:5080
I0221 18:36:36.728163      15 groups.go:135] Connected to group zero. Assigned group: 0
I0221 18:36:36.728190      15 groups.go:137] Raft Id after connection to Zero: 0x3
I0221 18:36:36.728243      15 pool.go:164] CONN: Connecting to alpha1:7080
I0221 18:36:36.728263      15 draft.go:270] Node ID: 0x3 with GroupID: 1
I0221 18:36:36.728267      15 draft.go:279] RaftContext: id:3 group:1 addr:"alpha2:7080" 
I0221 18:36:36.728315      15 node.go:153] Setting raft.Config to: &{ID:3 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00044e700 Applied:150 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x3113340 DisableProposalForwarding:false}
I0221 18:36:36.732598      15 node.go:312] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:150 Term:5 XXX_unrecognized:[]}
I0221 18:36:36.732674      15 node.go:323] Found hardstate: {Term:30 Vote:3 Commit:6886 XXX_unrecognized:[]}
I0221 18:36:36.736962      15 node.go:328] Group 1 found 6885 entries
I0221 18:36:36.737011      15 draft.go:1828] Restarting node for group: 1
I0221 18:36:36.737031      15 node.go:190] Setting conf state to nodes:1 nodes:2 nodes:3 
2023/02/21 18:36:36 3 state.commit 6886 is out of range [150, 6885]
panic: 3 state.commit 6886 is out of range [150, 6885]
goroutine 203 [running]:
log.Panicf({0x1f5c0b6?, 0xaf3305?}, {0xc0004b1580?, 0x1c949c0?, 0x7fc2dcb3bf00?})
/opt/hostedtoolcache/go/1.18.5/x64/src/log/log.go:392 +0x67
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x1ae5?, {0x1f5c0b6?, 0xaf9ff4?}, {0xc0004b1580?, 0xaf5a87?, 0x40?})
/home/runner/work/dgraph/dgraph/x/log.go:40 +0x32
go.etcd.io/etcd/raft.(*raft).loadState(0xc000376780, {0x1e, 0x3, 0x1ae6, {0x0, 0x0, 0x0}})
/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20190108173120-83c051b701d3/raft/raft.go:1475 +0x19a
go.etcd.io/etcd/raft.newRaft(0xc000336f20)
/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20190108173120-83c051b701d3/raft/raft.go:377 +0x791
go.etcd.io/etcd/raft.RestartNode(0xc000336f20)
/home/runner/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20190108173120-83c051b701d3/raft/node.go:242 +0x33
github.com/dgraph-io/dgraph/worker.(*node).InitAndStartNode(0xc000422500)
/home/runner/work/dgraph/dgraph/worker/draft.go:1854 +0x6f2
github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc00044e700, 0x1?)
/home/runner/work/dgraph/dgraph/worker/groups.go:155 +0x635
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run.func4()
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:814 +0x34
created by github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:813 +0x14b1
W0221 18:36:36.747097       1 sentry_integration.go:140] unable to read CID from file /tmp/dgraph-alpha-cid-sentry open /tmp/dgraph-alpha-cid-sentry: no such file or directory. Skip
[Sentry] 2023/02/21 18:36:36 Sending fatal event [f5a6939fde6a49c28445d4dd5a95bcb8] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2023/02/21 18:36:36 Buffer flushed successfully.

MichelDiz · February 21, 2023, 10:50pm

Just downgrade to your previous version. BTW, if you are using the tag “latest” in your docker compose or something. Remove it now, it is not recommended to prod.

In general you have to export your data if you jump for example from 21.0.1 to 22.0.2 if it is a minor it won’t hurt if you upgrade, but medium or major upgrades you have to export and reimport.

mko0815 · February 22, 2023, 5:15am

sorry if I misspelled,

I had the issue without any version change of dgraph.
I started version 22.0.2 with a completely clean/empty set.

all worked fine, and then after the outage, I got the error above. The version was untouched.

amanmangal · February 22, 2023, 5:23am

@mko0815 will it be possible for you to share one of the p directories if it is not proprietary, we will be happy to take a look? You could share it on aman@dgraph.io for now.

mko0815 · February 22, 2023, 6:15am

I send you an email,
thank you.

MichelDiz · February 22, 2023, 8:04pm

This implies that there is a conflict, there is some node stating that the transaction is in a state X, but another node is with another different state of Commit from that known. This is certainly due to the reuse of the same PATH containing files from another cluster executed previously there. That is, it created a new cluster without proper cleaning.

So if this happened because of an outage. It really can be a problem for Dgraph, that may not handled well with the event.

mko0815 · February 23, 2023, 11:02am

all 3 nodes have this error.
No node was able to start after the power outage.

mko0815 · March 13, 2023, 9:13am

I now have the issue again.

I upgraded on Friday to 23-beta-1 all worked fine.
Then during the weekend, there was a power outage again, and now 2 out of 3 alphas didn’t start because of the same error.

MichelDiz · March 14, 2023, 5:22pm

Can you open an ticket ar Sign in to GitHub · GitHub?
So I can assign someone to take a look.

mko0815 · March 15, 2023, 6:26pm

created a ticket:

github.com/dgraph-io/dgraph

[BUG]: after power outage alphas don't start anymore

opened 06:26PM - 15 Mar 23 UTC

fmt-Println-MKO

kind/bug community

### What version of Dgraph are you using? happens in: v22.0.2 and 23-beta-1 #…## Tell us a little more about your go-environment? Go version : go1.18.5 jemalloc enabled : true ### Have you tried reproducing the issue with the latest release? Yes ### What is the hardware spec (RAM, CPU, OS)? Docker, 4 vCPU, 8 GB Ram ### What steps will reproduce the bug? Start a local cluster with 3 alphas and 1 zero in docker. add a schema and data. plug the power of while the system is running. sometimes, 1 alpha sometimes 2 alphas or it also happens that all 3 alphas can't start now: `2023/02/21 18:36:36 3 state.commit 6886 is out of range [150, 6885] panic: 3 state.commit 6886 is out of range [150, 6885]` ### Expected behavior and actual result. if an alpha is out of sync it should sync itself ### Additional information more details are discussed in the chat here: https://discuss.dgraph.io/t/power-outage-and-now-the-cluster-didnt-start-again/18314are

Topic		Replies	Views
Bug after powerloss still happening Dgraph	0	210	November 15, 2023
Dgraph Alpha Node unresponsive Dgraph	10	1089	September 10, 2022
[disaster recovery] Cluster unable to recover after crash during intensive writing operations Dgraph dgraph , status:accepted , area:performance	12	1263	November 17, 2020
Production instance is taking entire load for cluster Users	8	684	November 21, 2019
Disaster recovery Dgraph dgraph	1	426	February 25, 2021

Power outage and now the cluster didn't start again

Related topics