Dgraph fails to start on restarts with Kind (Kubernetes)

Report a Dgraph Bug

What version of Dgraph are you using?

v20.07.1

What is the hardware spec (RAM, OS)?

Ubuntu 20.04, 32 GB RAM

Steps to reproduce the issue (command/config used to run Dgraph).

  1. Create a kubernetes cluster locally using kind
  2. Deploy Dgraph (1 zero, 1 alpha, 1 ratel) using the helm chart using the inbuilt standard storage class in kind which uses rancher’s provisioner (https://github.com/rancher/local-path-provisioner) with ReadWriteOnce set
helm upgrade --install --namespace <namespace> dgraph -f <path>/values.yaml  --set zero.replicaCount=1 --set alpha.replicaCount=1 --set alpha.persistence.size=200Mi --set zero.persistence.size=200Mi --set alpha.persistence.accessModes={ReadWriteOnce} --set zero.persistence.accessModes={ReadWriteOnce} --set zero.persistence.storageClass=standard --set alpha.persistence.storageClass=standard ./charts/dgraph
  1. Everything works great.
  2. Now, restart the cluster by restarting docker with sudo service docker restart
  3. Dgraph fails to start with errors (pods are running though):

Alpha:

E1023 15:47:59.946666      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:00.946904      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I1023 15:48:01.144411      19 admin.go:593] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
E1023 15:48:01.947163      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:02.948124      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:03.948409      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:04.948747      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:05.948927      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I1023 15:48:06.144927      19 admin.go:593] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
E1023 15:48:06.949090      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:07.949459      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:08.949704      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:09.949929      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>

Zero:

I1023 15:35:19.507573      18 run.go:127] Setting up grpc listener at: 0.0.0.0:5080
I1023 15:35:19.508051      18 run.go:127] Setting up http listener at: 0.0.0.0:6080
I1023 15:35:19.508353      18 run.go:298] Opening zero BadgerDB with options: {Dir:zw ValueDir:zw SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0xc00007bfa0 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:false BlockCacheSize:0 IndexCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:15 LevelOneSize:268435456 ValueLogFileSize:67108864 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:true managedTxns:false maxBatchCount:0 maxBatchSize:0}
badger 2020/10/23 15:35:19 INFO: All 1 tables opened in 74ms
badger 2020/10/23 15:35:20 INFO: Replaying file id: 0 at offset: 21571
badger 2020/10/23 15:35:20 INFO: Replay took: 12.285µs
I1023 15:35:20.190879      18 node.go:149] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0003c4440 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2be83e8 DisableProposalForwarding:false}
I1023 15:35:20.191281      18 node.go:318] Found hardstate: {Term:2 Vote:1 Commit:132 XXX_unrecognized:[]}
I1023 15:35:20.191555      18 node.go:327] Group 0 found 132 entries
I1023 15:35:20.191575      18 raft.go:448] Restarting node for dgraphzero
I1023 15:35:20.191835      18 log.go:34] 1 became follower at term 2
I1023 15:35:20.192309      18 log.go:34] newRaft 1 [peers: [], term: 2, commit: 132, applied: 0, lastindex: 132, lastterm: 2]
I1023 15:35:20.192522      18 run.go:374] Running Dgraph Zero...
I1023 15:35:20.505187      18 node.go:186] Setting conf state to nodes:1 
I1023 15:35:20.505312      18 raft.go:708] Done applying conf change at 0x1
I1023 15:35:20.505627      18 pool.go:160] CONNECTING to dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080
W1023 15:35:20.505659      18 pool.go:191] Shutting down extra connection to dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080
W1023 15:35:20.509770      18 raft.go:739] Raft.Ready took too long to process: Timer Total: 316ms. Breakdown: [{sync 225ms} {disk 86ms} {proposals 5ms} {advance 0s}]. Num entries: 0. MustSync: true
I1023 15:35:21.193543      18 log.go:34] 1 no leader at term 2; dropping index reading msg
W1023 15:35:23.193596      18 node.go:675] [0x1] Read index context timed out
I1023 15:35:23.193654      18 log.go:34] 1 no leader at term 2; dropping index reading msg
I1023 15:35:23.893708      18 log.go:34] 1 is starting a new election at term 2
I1023 15:35:23.893752      18 log.go:34] 1 became pre-candidate at term 2
I1023 15:35:23.893761      18 log.go:34] 1 received MsgPreVoteResp from 1 at term 2
I1023 15:35:23.893978      18 log.go:34] 1 became candidate at term 3
I1023 15:35:23.893997      18 log.go:34] 1 received MsgVoteResp from 1 at term 3
I1023 15:35:23.894327      18 log.go:34] 1 became leader at term 3
I1023 15:35:23.894358      18 log.go:34] raft.node: 1 elected leader 1 at term 3
I1023 15:35:23.894424      18 raft.go:673] I've become the leader, updating leases.
I1023 15:35:23.894444      18 assign.go:42] Updated Lease id: 10001. Txn Ts: 10001
W1023 15:35:25.193693      18 node.go:675] [0x1] Read index context timed out
W1023 15:35:25.864210      18 pool.go:254] Connection lost with dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080. Error: rpc error: code = Canceled desc = context canceled
W1023 15:35:25.865755      18 pool.go:254] Connection lost with dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080. Error: rpc error: code = Unknown desc = No node has been set up yet

More info:

Name:               dgraph-dgraph-alpha
Namespace:          db
CreationTimestamp:  Fri, 23 Oct 2020 16:07:34 +0000
Selector:           app=dgraph,chart=dgraph-0.0.11,component=alpha,release=dgraph
Labels:             app=dgraph
                    app.kubernetes.io/managed-by=Helm
                    chart=dgraph-0.0.11
                    component=alpha
                    heritage=Helm
                    release=dgraph
Annotations:        meta.helm.sh/release-name: dgraph
                    meta.helm.sh/release-namespace: db
Replicas:           1 desired | 0 total
Update Strategy:    RollingUpdate
Pods Status:        0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       app=dgraph
                chart=dgraph-0.0.11
                component=alpha
                release=dgraph
  Annotations:  prometheus.io/path: /debug/prometheus_metrics
                prometheus.io/port: 8080
                prometheus.io/scrape: true
  Containers:
   dgraph-dgraph-alpha:
    Image:       docker.io/dgraph/dgraph:v20.07.1
    Ports:       7080/TCP, 8080/TCP, 9080/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      bash
      -c
      set -ex
      dgraph alpha --my=$(hostname -f):7080 --zero dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080
      
    Requests:
      memory:  100Mi
    Environment:
      POD_NAMESPACE:         (v1:metadata.namespace)
      DGRAPH_ALPHA_CONFIG:  /dgraph/config/config.toml
    Mounts:
      /dgraph from datadir (rw)
      /dgraph/config from config-volume (rw)
  Volumes:
   datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir
    ReadOnly:   false
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dgraph-dgraph-alpha-config
    Optional:  false
Volume Claims:
  Name:          datadir
  StorageClass:  standard
  Labels:        <none>
  Annotations:   volume.alpha.kubernetes.io/storage-class=anything
  Capacity:      200Mi
  Access Modes:  [ReadWriteOnce]
Events:          <none>

It works again only after I destroy the entire cluster and create it again.

Just to validate if normal pod restart works, I tried running kubectl -n db rollout restart statefulset dgraph-dgraph-alpha and everything was great.

Expected behaviour and actual result.

Dgraph should work even after restarts - important cause I am running kind within a pre-emptible VM in GCP which resets every 24 hours. Since I am not sure if this is an issue with Dgraph or Kind, I have filed an issue in Kind repo too here: https://github.com/kubernetes-sigs/kind/issues/1909

This occurs everyday and the error is reproducible consistently. Is there any other info you would like me to provide?

Looking at your situation, preemptible GCE running Dgraph cluster in KinD, here are some thoughts:

  • restarting docker would seem to simulate a destructive event equivalent of restarting GKE vm instances (GCE) all at once. I haven’t tested this scenario yet.
  • the big question is if the data saved into a volume so that it persists across destructive events?
    • In the preemptible scenario, where google claims the GCE resource, this would happen, unless both a docker volume resource is allocated and that resides on an external disk, mounted by GCE at started, docker uses the same volume when starting up KinD, and the dgraph pods re-attach to the volume (3 layers - pvc/pv to volume -> docker volume -> GCE external volume).
    • Why persistent volume so important? Dgraph for both zero and alphas maintain the state of cluster and membership with data saved on disk, so if data is lost that state cannot be discovered, and the cluster is in a funk. Also, of course, graph data is stored on the disk for alphas.
  • Recently, in my testing, I came across an issue with dgraph v20.07.1 used by the helm chart, where dgraph alpha may not terminate, and so it gets stuck. When this happens, I had to use kubectl delete --force to forcibly remove these. I don’t think you are running into this as restart seemed to work fine.

I haven’t tested KinD yet or the rancher local-path-provisioner, so this is something I can try out, see what I can discover.

@joaquin Thanks again for your quick reply.

I have my storage SSDs set to persist (both Boot disk and attached disk) and so, it would stay across destructive events as you rightly said.

I came across something similar where I was not able to delete the PVC provisioned by Dgraph and it got stuck. I tried force deleting and it didn’t work. Then I removed the finalizers from the PVCs and then I was able to delete the PVCs. But anyways, I don’t think that is related to this.

If there is anything you would like me to try, do let me know. Thanks.

@joaquin Just got a reply from Ben from Kind here: https://github.com/kubernetes-sigs/kind/issues/1909#issuecomment-717611786 and after seeing that, I also did a test on my end to dig into this further and from the looks of it, I guess the data is still available in the PVC even after Docker/VM restarts but still there is something else which is causing the issue.

1

If you notice the date, those are the folders from the PVC mounted to Dgraph pod which were created 1 day ago (before the pre-emptible VM restart). So, the data does exist in the PVC but Dgraph fails.

My guess is that Dgraph might be maintaining some kind of state or info in memory which can get lost across restarts (but strange to see this not happening when doing a rollout restart) maybe because all the nodes are not down all at once as you said.

The disk storage is necessary to maintain state (under the hood, raft protocol), which is similar to what clusters use to maintain their state and elect leaders, such as zookeeper, etcd, and consul. Thus if the data is lost, which means the state of the cluster is lost, and you in a way start from scratch.

One thing to look at is if the same claim (PVC) is still used to the correct volume (PV), or are new claims made. The reclaim policy for the storage should not be set to delete, but rather retain, so that it can scope up existing storage after an outage event. Otherwise, if the pods are cycled out, they may not be able to grab onto the same volume.

@joaquin Hi. While the reclaim policy of my the PVs are set to Delete in my case, I am pretty sure that it may not be the problem cause even after restarts, the same PV is used by the PVC and the PVC is also bound to the POD properly but still to validate your assumption, I did give it a test.

With Reclaim policy as Delete

PV and PVC

ALPHA POD

Storage class:

With Reclaim policy as Retain

But I still get the same errors as mentioned.

Just had a look at these:

and the docs here (https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms#using_node_taints) say: Do not use Preemptible VMs with stateful Pods because they could violate the at-most-one semantics inherent to StatefulSets and could lead to data loss.

Wondering what to do.

The problem here is that its not just about the pre-emptible VM. Even if I provision a normal VM with Kind cluster within it and if I shut down the VM sometime to save on cost and start it again, kind is anyways going to restart which will lead to the same state since this happens when restarting docker/cluster.

Actually I am fine with destroying the cluster and creating afresh again everytime since it doesn’t take much time but I am not sure how I can automatically restore the data since the PVCs will get destroyed as well and Dgraph export also sends the data to PVC.

Or if there is a way to gracefully run some sort of shutdown script for Dgraph, I can do that as well which can get triggered before the pre-emptible VM shuts down (won’t work on docker restart though).

Looks like the error I get is coming from https://github.com/dgraph-io/dgraph/blob/a63af32ce3d0422da613735fc0f7b4c4068aa405/worker/groups.go#L1139

Hey Vignesh,

I see that the two issues are:

  • Data Persistence
    • For this could you try adding extra mounts from the host onto the Kind worker container? This way you could use a hostPath of that path onto your Dgraph cluster? The config for which would look like this - https://gist.github.com/slotlocker2/a7b2fb6cddda27738a9140476cda076f ( you’d need to create the directories locally, say, mkdir -p ~/dgraph-host-mount/alpha ~/dgraph-host-mount/zero and reflect that path onto config.yaml). The data survives a docker restart.
  • Ensure Dgraph shuts down cleanly when the GKE preemption happens

If you want to run a “multi-node” KIND cluster or a 6 node Dgraph cluster(for which you’d need a volumeClaimTemplate), you might want to check out a dynamic hostpath provisioner like https://github.com/rimusz/hostpath-provisioner(haven’t tested)

Please check if this helps. :slight_smile:

@slotlocker2 Thanks a lot for the tips. Will try out everything you mentioned and get back to you tomorrow with the results. I would have to modify the existing Dgraph Helm chart to make the relevant changes as given in your yaml as well - Will try hostpath-provisioner (last commit was 1 year ago though - so, not sure if it will still work).

Regarding the preemptible killer, is it really the only way to go about safe termination?

Hi Vignesh,

(just realized the Kind config is named kind-extra-config as opposed to say a config.yaml in the gist above)

Regarding the host path provisioner you could use the one from Rancher as long as you are pointing it to the path mounted onto the kind worker container from the underlying GKE persistent disk. Maybe you wouldn’t need to make the hostPath changes in Helm if you can get the local provisioner to work in such a configuration.

Dgraph should safely terminate soon enough unless the underlying node shuts down rather abruptly SIGKILLing the process and not allowing it to do the cleanup. A similar project for AWS does just that - sends SIGTERM to the pods so they can exit cleanly as and when it gets the 2 minute spot shutdown notification(grace period would ideally be under 120 seconds to work on spot in which case).

@slotlocker2 @joaquin Thanks for your replies. I tried @slotlocker2’s suggestion today and here are the results.

  1. Using hostPath without any provisioner (same gist as shared) and restarting - Worked well
  2. Mounting /var/local-path-provisioner from the Kind node onto the host using extraMounts and restarting the cluster - Did not work
  3. Replacing the same gist shared by @slotlocker2 to use the standard storage class instead of using hostpath and restarting - Worked well
  4. Using the Dgraph helm chart with standard storage class and restarting - Did not work

I will go ahead with what is working now (will use the helm chart for prod and yaml for dev for now) but to actually help you reproduce the problem I am facing, I have added a gist here which shares the config:

THIS DOES NOT WORK ON RESTARTS: https://gist.github.com/tvvignesh/d7114658800add046460b38c74ae2f5f#file-dgraph-buggy-yml

THIS WORKS ON RESTARTS: https://gist.github.com/tvvignesh/d7114658800add046460b38c74ae2f5f#file-dgraph-working-yml

So, looks like its not an issue with Kind or the Local path provisioner but has something to do with the helm chart.

The significant difference I see is the command being used to start zero

Helm chart has:

- bash
- "-c"
- |
  set -ex
  [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
    ordinal=${BASH_REMATCH[1]}
    idx=$(($ordinal + 1))
    if [[ $ordinal -eq 0 ]]; then
      exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 5
    else
      exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 5
    fi

and the yaml has:

- bash
- "-c"
- |
  set -ex
  dgraph zero --my=$(hostname -f):5080

and also the fact that the helm chart splits the deployment to multiple pods while this uses multiple containers within 1 pod.

Just to see the difference before and after restart in the pods with helm chart, I did a diff in the env and looks like the only thing which changes is the PPID across restarts

Zero before and after: https://www.diffchecker.com/4XLv6ADk
Alpha before and after: https://www.diffchecker.com/AblTFiz1

Everything else remains the same as you see.

UPDATE: GCE just killed my VM and the dgraph cluster restarted without any issues now. So, we might have to see how to fix the helm chart to match this: https://gist.github.com/tvvignesh/d7114658800add046460b38c74ae2f5f#file-dgraph-working-yml

Thanks again.

1 Like