Dgraph has been running in a stable manner on GKE for the last 6 weeks with the following config:
image:
tag: "v22.0.2"
alpha:
replicaCount: 9
persistence:
size: 250Gi
storageClassName: premium-rwo # Doesn't work. Have to do patch. See setup docs.
service:
type: LoadBalancer
annotations:
cloud.google.com/load-balancer-type: "Internal"
zero:
replicaCount: 3
shardReplicaCount: 3
persistence:
size: 32Gi
storageClassName: premium-rwo # Doesn't work. Have to do patch. See setup docs.
ratel:
service:
type: LoadBalancer
annotations:
cloud.google.com/load-balancer-type: "Internal"
However, 2 days ago we saw the following error:
I0301 09:40:26.686947 23 log.go:34] All 3153 tables opened in 919ms
I0301 09:40:26.691624 23 log.go:34] Discard stats nextEmptySlot: 0
I0301 09:40:26.694320 23 log.go:34] Set nextTxnTs to 24405663
I0301 09:40:26.704901 23 log.go:34] Deleting empty file: p/000230.vlog
I0301 09:40:26.705919 23 groups.go:100] Current Raft Id: 0x2
I0301 09:40:26.705940 23 groups.go:116] Sending member request to Zero: id:2 addr:"xyz-prod-dgraph-alpha-0.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080"
I0301 09:40:26.706101 23 worker.go:114] Worker listening at address: [::]:7080
I0301 09:40:26.707415 23 run.go:567] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0301 09:40:26.707433 23 run.go:568] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0301 09:40:26.707443 23 run.go:595] gRPC server started. Listening on port 9080
E0301 09:40:26.707438 23 groups.go:1224] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0301 09:40:26.707451 23 run.go:596] HTTP server started. Listening on port 8080
I0301 09:40:26.807046 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-zero-0.xyz-prod-dgraph-zero-headless.xyz.svc.cluster.local:5080
I0301 09:40:26.814126 23 groups.go:135] Connected to group zero. Assigned group: 0
I0301 09:40:26.814148 23 groups.go:137] Raft Id after connection to Zero: 0x2
I0301 09:40:26.814200 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-6.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814215 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-8.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814242 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-7.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814280 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-4.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814303 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-3.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814341 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-5.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814379 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-2.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814409 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-alpha-1.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080
I0301 09:40:26.814498 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-zero-2.xyz-prod-dgraph-zero-headless.xyz.svc.cluster.local:5080
I0301 09:40:26.814526 23 pool.go:164] CONN: Connecting to xyz-prod-dgraph-zero-1.xyz-prod-dgraph-zero-headless.xyz.svc.cluster.local:5080
I0301 09:40:26.814574 23 draft.go:270] Node ID: 0x2 with GroupID: 1
I0301 09:40:26.814583 23 draft.go:279] RaftContext: id:2 group:1 addr:"xyz-prod-dgraph-alpha-0.xyz-prod-dgraph-alpha-headless.xyz.svc.cluster.local:7080"
I0301 09:40:26.814671 23 node.go:153] Setting raft.Config to: &{ID:2 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00045aac0 Applied:22163404 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x3113340 DisableProposalForwarding:false}
I0301 09:40:26.920997 23 node.go:312] Found Snapshot.Metadata: {ConfState:{Nodes:[2 3 4] Learners:[] XXX_unrecognized:[]} Index:22163404 Term:91 XXX_unrecognized:[]}
I0301 09:40:26.921030 23 node.go:323] Found hardstate: {Term:93 Vote:3 Commit:22171269 XXX_unrecognized:[]}
I0301 09:40:27.023525 23 node.go:328] Group 1 found 15827 entries
I0301 09:40:27.023564 23 draft.go:1828] Restarting node for group: 1
I0301 09:40:27.023573 23 node.go:190] Setting conf state to nodes:2 nodes:3 nodes:4
2023/03/01 09:40:27 2 state.commit 22171269 is out of range [22163404, 22171268]
panic: 2 state.commit 22171269 is out of range [22163404, 22171268]
goroutine 3326 [running]:
log.Panicf({0x1f5c0b6?, 0xaf3305?}, {0xc00066ca40?, 0x1c949c0?, 0x7fcc51dbfa00?})
Now this happened for the first time 2 days ago. We solved this in the end in what feels a very hacky solution by deleting from the PVC associated to the failing alpha the w
directory (after backing up inspecting the files and be comfortable with any data loss).
A quick snapshot of the w
folder for the failing alpha is here:
/pvc/w # ls -l
total 137664
-rw-r--r-- 1 *** *** 268435456 Feb 28 19:35 00930.wal
-rw-r--r-- 1 *** *** 1048576 Mar 1 09:45 wal.meta
Could a) someone explain more the theory of whats going on, b) what kind of data loss one would expect from deleting `w, c) anyway to read the WAL files - when I inspect the WAL file i can see what objects it is references but any data is gibberish.
What I did previously
Please list the things you have tried.