Dgraph fails to start on restarts with Kind (Kubernetes)

tvvignesh · October 23, 2020, 7:17pm

Report a Dgraph Bug

What version of Dgraph are you using?

v20.07.1

What is the hardware spec (RAM, OS)?

Ubuntu 20.04, 32 GB RAM

Steps to reproduce the issue (command/config used to run Dgraph).

Create a kubernetes cluster locally using kind
Deploy Dgraph (1 zero, 1 alpha, 1 ratel) using the helm chart using the inbuilt standard storage class in kind which uses rancher’s provisioner (GitHub - rancher/local-path-provisioner: Dynamically provisioning persistent local storage with Kubernetes) with ReadWriteOnce set

helm upgrade --install --namespace <namespace> dgraph -f <path>/values.yaml  --set zero.replicaCount=1 --set alpha.replicaCount=1 --set alpha.persistence.size=200Mi --set zero.persistence.size=200Mi --set alpha.persistence.accessModes={ReadWriteOnce} --set zero.persistence.accessModes={ReadWriteOnce} --set zero.persistence.storageClass=standard --set alpha.persistence.storageClass=standard ./charts/dgraph

Everything works great.
Now, restart the cluster by restarting docker with sudo service docker restart
Dgraph fails to start with errors (pods are running though):

Alpha:

E1023 15:47:59.946666      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:00.946904      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I1023 15:48:01.144411      19 admin.go:593] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
E1023 15:48:01.947163      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:02.948124      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:03.948409      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:04.948747      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:05.948927      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I1023 15:48:06.144927      19 admin.go:593] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.
E1023 15:48:06.949090      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:07.949459      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:08.949704      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E1023 15:48:09.949929      19 groups.go:1137] Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>

Zero:

I1023 15:35:19.507573      18 run.go:127] Setting up grpc listener at: 0.0.0.0:5080
I1023 15:35:19.508051      18 run.go:127] Setting up http listener at: 0.0.0.0:6080
I1023 15:35:19.508353      18 run.go:298] Opening zero BadgerDB with options: {Dir:zw ValueDir:zw SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0xc00007bfa0 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:false BlockCacheSize:0 IndexCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:15 LevelOneSize:268435456 ValueLogFileSize:67108864 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:true managedTxns:false maxBatchCount:0 maxBatchSize:0}
badger 2020/10/23 15:35:19 INFO: All 1 tables opened in 74ms
badger 2020/10/23 15:35:20 INFO: Replaying file id: 0 at offset: 21571
badger 2020/10/23 15:35:20 INFO: Replay took: 12.285µs
I1023 15:35:20.190879      18 node.go:149] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0003c4440 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2be83e8 DisableProposalForwarding:false}
I1023 15:35:20.191281      18 node.go:318] Found hardstate: {Term:2 Vote:1 Commit:132 XXX_unrecognized:[]}
I1023 15:35:20.191555      18 node.go:327] Group 0 found 132 entries
I1023 15:35:20.191575      18 raft.go:448] Restarting node for dgraphzero
I1023 15:35:20.191835      18 log.go:34] 1 became follower at term 2
I1023 15:35:20.192309      18 log.go:34] newRaft 1 [peers: [], term: 2, commit: 132, applied: 0, lastindex: 132, lastterm: 2]
I1023 15:35:20.192522      18 run.go:374] Running Dgraph Zero...
I1023 15:35:20.505187      18 node.go:186] Setting conf state to nodes:1 
I1023 15:35:20.505312      18 raft.go:708] Done applying conf change at 0x1
I1023 15:35:20.505627      18 pool.go:160] CONNECTING to dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080
W1023 15:35:20.505659      18 pool.go:191] Shutting down extra connection to dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080
W1023 15:35:20.509770      18 raft.go:739] Raft.Ready took too long to process: Timer Total: 316ms. Breakdown: [{sync 225ms} {disk 86ms} {proposals 5ms} {advance 0s}]. Num entries: 0. MustSync: true
I1023 15:35:21.193543      18 log.go:34] 1 no leader at term 2; dropping index reading msg
W1023 15:35:23.193596      18 node.go:675] [0x1] Read index context timed out
I1023 15:35:23.193654      18 log.go:34] 1 no leader at term 2; dropping index reading msg
I1023 15:35:23.893708      18 log.go:34] 1 is starting a new election at term 2
I1023 15:35:23.893752      18 log.go:34] 1 became pre-candidate at term 2
I1023 15:35:23.893761      18 log.go:34] 1 received MsgPreVoteResp from 1 at term 2
I1023 15:35:23.893978      18 log.go:34] 1 became candidate at term 3
I1023 15:35:23.893997      18 log.go:34] 1 received MsgVoteResp from 1 at term 3
I1023 15:35:23.894327      18 log.go:34] 1 became leader at term 3
I1023 15:35:23.894358      18 log.go:34] raft.node: 1 elected leader 1 at term 3
I1023 15:35:23.894424      18 raft.go:673] I've become the leader, updating leases.
I1023 15:35:23.894444      18 assign.go:42] Updated Lease id: 10001. Txn Ts: 10001
W1023 15:35:25.193693      18 node.go:675] [0x1] Read index context timed out
W1023 15:35:25.864210      18 pool.go:254] Connection lost with dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080. Error: rpc error: code = Canceled desc = context canceled
W1023 15:35:25.865755      18 pool.go:254] Connection lost with dgraph-dgraph-alpha-0.dgraph-dgraph-alpha-headless.db.svc.cluster.local:7080. Error: rpc error: code = Unknown desc = No node has been set up yet

More info:

Name:               dgraph-dgraph-alpha
Namespace:          db
CreationTimestamp:  Fri, 23 Oct 2020 16:07:34 +0000
Selector:           app=dgraph,chart=dgraph-0.0.11,component=alpha,release=dgraph
Labels:             app=dgraph
                    app.kubernetes.io/managed-by=Helm
                    chart=dgraph-0.0.11
                    component=alpha
                    heritage=Helm
                    release=dgraph
Annotations:        meta.helm.sh/release-name: dgraph
                    meta.helm.sh/release-namespace: db
Replicas:           1 desired | 0 total
Update Strategy:    RollingUpdate
Pods Status:        0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       app=dgraph
                chart=dgraph-0.0.11
                component=alpha
                release=dgraph
  Annotations:  prometheus.io/path: /debug/prometheus_metrics
                prometheus.io/port: 8080
                prometheus.io/scrape: true
  Containers:
   dgraph-dgraph-alpha:
    Image:       docker.io/dgraph/dgraph:v20.07.1
    Ports:       7080/TCP, 8080/TCP, 9080/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      bash
      -c
      set -ex
      dgraph alpha --my=$(hostname -f):7080 --zero dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080
      
    Requests:
      memory:  100Mi
    Environment:
      POD_NAMESPACE:         (v1:metadata.namespace)
      DGRAPH_ALPHA_CONFIG:  /dgraph/config/config.toml
    Mounts:
      /dgraph from datadir (rw)
      /dgraph/config from config-volume (rw)
  Volumes:
   datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir
    ReadOnly:   false
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      dgraph-dgraph-alpha-config
    Optional:  false
Volume Claims:
  Name:          datadir
  StorageClass:  standard
  Labels:        <none>
  Annotations:   volume.alpha.kubernetes.io/storage-class=anything
  Capacity:      200Mi
  Access Modes:  [ReadWriteOnce]
Events:          <none>

It works again only after I destroy the entire cluster and create it again.

Just to validate if normal pod restart works, I tried running kubectl -n db rollout restart statefulset dgraph-dgraph-alpha and everything was great.

Expected behaviour and actual result.

Dgraph should work even after restarts - important cause I am running kind within a pre-emptible VM in GCP which resets every 24 hours. Since I am not sure if this is an issue with Dgraph or Kind, I have filed an issue in Kind repo too here: dgraph errors on restart · Issue #1909 · kubernetes-sigs/kind · GitHub

tvvignesh · October 26, 2020, 7:41am

This occurs everyday and the error is reproducible consistently. Is there any other info you would like me to provide?

joaquin · October 26, 2020, 9:03am

Looking at your situation, preemptible GCE running Dgraph cluster in KinD, here are some thoughts:

restarting docker would seem to simulate a destructive event equivalent of restarting GKE vm instances (GCE) all at once. I haven’t tested this scenario yet.
the big question is if the data saved into a volume so that it persists across destructive events?
- In the preemptible scenario, where google claims the GCE resource, this would happen, unless both a docker volume resource is allocated and that resides on an external disk, mounted by GCE at started, docker uses the same volume when starting up KinD, and the dgraph pods re-attach to the volume (3 layers - pvc/pv to volume → docker volume → GCE external volume).
- Why persistent volume so important? Dgraph for both zero and alphas maintain the state of cluster and membership with data saved on disk, so if data is lost that state cannot be discovered, and the cluster is in a funk. Also, of course, graph data is stored on the disk for alphas.
Recently, in my testing, I came across an issue with dgraph v20.07.1 used by the helm chart, where dgraph alpha may not terminate, and so it gets stuck. When this happens, I had to use kubectl delete --force to forcibly remove these. I don’t think you are running into this as restart seemed to work fine.

I haven’t tested KinD yet or the rancher local-path-provisioner, so this is something I can try out, see what I can discover.

tvvignesh · October 26, 2020, 5:33pm

@joaquin Thanks again for your quick reply.

I have my storage SSDs set to persist (both Boot disk and attached disk) and so, it would stay across destructive events as you rightly said.

I came across something similar where I was not able to delete the PVC provisioned by Dgraph and it got stuck. I tried force deleting and it didn’t work. Then I removed the finalizers from the PVCs and then I was able to delete the PVCs. But anyways, I don’t think that is related to this.

If there is anything you would like me to try, do let me know. Thanks.

tvvignesh · October 28, 2020, 8:51am

@joaquin Just got a reply from Ben from Kind here: dgraph errors on restart · Issue #1909 · kubernetes-sigs/kind · GitHub and after seeing that, I also did a test on my end to dig into this further and from the looks of it, I guess the data is still available in the PVC even after Docker/VM restarts but still there is something else which is causing the issue.

If you notice the date, those are the folders from the PVC mounted to Dgraph pod which were created 1 day ago (before the pre-emptible VM restart). So, the data does exist in the PVC but Dgraph fails.

My guess is that Dgraph might be maintaining some kind of state or info in memory which can get lost across restarts (but strange to see this not happening when doing a rollout restart) maybe because all the nodes are not down all at once as you said.

joaquin · October 29, 2020, 12:29am

The disk storage is necessary to maintain state (under the hood, raft protocol), which is similar to what clusters use to maintain their state and elect leaders, such as zookeeper, etcd, and consul. Thus if the data is lost, which means the state of the cluster is lost, and you in a way start from scratch.

One thing to look at is if the same claim (PVC) is still used to the correct volume (PV), or are new claims made. The reclaim policy for the storage should not be set to delete, but rather retain, so that it can scope up existing storage after an outage event. Otherwise, if the pods are cycled out, they may not be able to grab onto the same volume.

tvvignesh · October 29, 2020, 10:17am

@joaquin Hi. While the reclaim policy of my the PVs are set to Delete in my case, I am pretty sure that it may not be the problem cause even after restarts, the same PV is used by the PVC and the PVC is also bound to the POD properly but still to validate your assumption, I did give it a test.

With Reclaim policy as Delete

PV and PVC

ALPHA POD

Storage class:

With Reclaim policy as Retain

But I still get the same errors as mentioned.

tvvignesh · October 29, 2020, 11:49am

Just had a look at these:

and the docs here (Use preemptible VMs to run fault-tolerant workloads | Google Kubernetes Engine (GKE) | Google Cloud) say: Do not use Preemptible VMs with stateful Pods because they could violate the at-most-one semantics inherent to StatefulSets and could lead to data loss.

github.com/kubernetes/kubernetes

docs: StatefulSet pod is never evicted from shutdown node

opened 08:26PM - 22 Oct 17 UTC

closed 07:13PM - 31 Dec 22 UTC

at1984z

area/app-lifecycle area/usability area/nodecontroller sig/node area/stateful-apps area/node-lifecycle lifecycle/rotten needs-triage

**Is this a BUG REPORT or FEATURE REQUEST?**: > Uncomment only one, leave it on its own line: > /kind bug > /kind feature **What happened**: 1. StatefulSet at scale 1 is created. 2. The only pod is placed and running on one of 2 worker nodes. 3. The worker node with running pod shuts down and never starts up again. 4. The pod never moved to other node. **What you expected to happen**: 1. The pod would move to other node after expiration of default tolerations "node.alpha.kubernetes.io/notReady:NoExecute for 300s" and "node.alpha.kubernetes.io/unreachable:NoExecute for 300s" 2. "kubectl delete pod pod-on-shutdown-node" would induce the expected movement while the node is down -- it did not happen either. **How to reproduce it (as minimally and precisely as possible)**: 1. Create StatefulSet spec with one container and one replica in, say, sset.yml. 2. Have kubernetes installation with 2 worker nodes. 3. kubectl create -f sset.yml 4. kubectl get pod, to check to see where the only pod is scheduled, say, node N. 5. shutdown node N with "shutdown -h". 6. check to see that the pod did not move to other worker node in 10 minutes after node N halt time. **Anything else we need to know?**: 1. A Deployment behaves as indicated in the "What you expected to happen" section. **Environment**: - Kubernetes version (use `kubectl version`): 1.8.1 - Cloud provider or hardware configuration**: Virtual machines with Vagrant 2.0.0 and VirtualBox 5.1.28-117968 on Intel(R) Xeon(R) CPU E5-2690 v3 24 cores with Ubuntu 16.04 LTS - OS (e.g. from /etc/os-release): Ubuntu 16.04.3 LTS (VM) - Kernel (e.g. `uname -a`): 4.4.0-96-generic (VM) - Install tools: kubeadm 1.8.1-00 - Others: --- Edit: Goal if this issue is to update the documentation and clarify the expected behavior as per: https://github.com/kubernetes/kubernetes/issues/54368#issuecomment-339378597

Wondering what to do.

tvvignesh · October 29, 2020, 2:06pm

The problem here is that its not just about the pre-emptible VM. Even if I provision a normal VM with Kind cluster within it and if I shut down the VM sometime to save on cost and start it again, kind is anyways going to restart which will lead to the same state since this happens when restarting docker/cluster.

Actually I am fine with destroying the cluster and creating afresh again everytime since it doesn’t take much time but I am not sure how I can automatically restore the data since the PVCs will get destroyed as well and Dgraph export also sends the data to PVC.

Or if there is a way to gracefully run some sort of shutdown script for Dgraph, I can do that as well which can get triggered before the pre-emptible VM shuts down (won’t work on docker restart though).

Looks like the error I get is coming from dgraph/groups.go at a63af32ce3d0422da613735fc0f7b4c4068aa405 · dgraph-io/dgraph · GitHub

slotlocker2 · October 29, 2020, 6:40pm

Hey Vignesh,

I see that the two issues are:

Data Persistence
- For this could you try adding extra mounts from the host onto the Kind worker container? This way you could use a hostPath of that path onto your Dgraph cluster? The config for which would look like this - config.yaml · GitHub ( you’d need to create the directories locally, say, mkdir -p ~/dgraph-host-mount/alpha ~/dgraph-host-mount/zero and reflect that path onto config.yaml). The data survives a docker restart.
Ensure Dgraph shuts down cleanly when the GKE preemption happens
- For this it’d be worthwhile to check out a preemption killer like GitHub - estafette/estafette-gke-preemptible-killer: Kubernetes controller to spread preemption for preemtible VMs in GKE to avoid mass deletion after 24 hours that can do the restart predictably and give you enough time to cleanup Dgraph shutdown via a decent enough terminationGracePeriodSeconds set on the statefulset.

If you want to run a “multi-node” KIND cluster or a 6 node Dgraph cluster(for which you’d need a volumeClaimTemplate), you might want to check out a dynamic hostpath provisioner like GitHub - rimusz/hostpath-provisioner: Dynamic Provisioning of Kubernetes HostPath Volumes(haven’t tested)

Please check if this helps.

tvvignesh · October 29, 2020, 7:36pm

@slotlocker2 Thanks a lot for the tips. Will try out everything you mentioned and get back to you tomorrow with the results. I would have to modify the existing Dgraph Helm chart to make the relevant changes as given in your yaml as well - Will try hostpath-provisioner (last commit was 1 year ago though - so, not sure if it will still work).

Regarding the preemptible killer, is it really the only way to go about safe termination?

slotlocker2 · October 29, 2020, 7:45pm

Hi Vignesh,

(just realized the Kind config is named kind-extra-config as opposed to say a config.yaml in the gist above)

Regarding the host path provisioner you could use the one from Rancher as long as you are pointing it to the path mounted onto the kind worker container from the underlying GKE persistent disk. Maybe you wouldn’t need to make the hostPath changes in Helm if you can get the local provisioner to work in such a configuration.

Dgraph should safely terminate soon enough unless the underlying node shuts down rather abruptly SIGKILLing the process and not allowing it to do the cleanup. A similar project for AWS does just that - sends SIGTERM to the pods so they can exit cleanly as and when it gets the 2 minute spot shutdown notification(grace period would ideally be under 120 seconds to work on spot in which case).

tvvignesh · October 30, 2020, 11:37am

@slotlocker2 @joaquin Thanks for your replies. I tried @slotlocker2’s suggestion today and here are the results.

Using hostPath without any provisioner (same gist as shared) and restarting - Worked well
Mounting /var/local-path-provisioner from the Kind node onto the host using extraMounts and restarting the cluster - Did not work
Replacing the same gist shared by @slotlocker2 to use the standard storage class instead of using hostpath and restarting - Worked well
Using the Dgraph helm chart with standard storage class and restarting - Did not work

I will go ahead with what is working now (will use the helm chart for prod and yaml for dev for now) but to actually help you reproduce the problem I am facing, I have added a gist here which shares the config:

THIS DOES NOT WORK ON RESTARTS: Dgraph-Kind restart repro · GitHub

THIS WORKS ON RESTARTS: Dgraph-Kind restart repro · GitHub

So, looks like its not an issue with Kind or the Local path provisioner but has something to do with the helm chart.

The significant difference I see is the command being used to start zero

Helm chart has:

- bash
- "-c"
- |
  set -ex
  [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
    ordinal=${BASH_REMATCH[1]}
    idx=$(($ordinal + 1))
    if [[ $ordinal -eq 0 ]]; then
      exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 5
    else
      exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 5
    fi

and the yaml has:

- bash
- "-c"
- |
  set -ex
  dgraph zero --my=$(hostname -f):5080

and also the fact that the helm chart splits the deployment to multiple pods while this uses multiple containers within 1 pod.

Just to see the difference before and after restart in the pods with helm chart, I did a diff in the env and looks like the only thing which changes is the PPID across restarts

Zero before and after: Saved diff 4XLv6ADk - Diff Checker
Alpha before and after: Saved diff AblTFiz1 - Diff Checker

Everything else remains the same as you see.

UPDATE: GCE just killed my VM and the dgraph cluster restarted without any issues now. So, we might have to see how to fix the helm chart to match this: Dgraph-Kind restart repro · GitHub

Thanks again.

Topic		Replies	Views
Dgraph crash loop on aws Dgraph	16	1630	June 8, 2020
DGraph deployment via helm not working anymore Users	4	1623	January 7, 2019
Can I use multiple path for data storage Dgraph kind:question	29	1247	January 28, 2021
Using Kubernetes - Deploy Documentation	1	1443	August 28, 2020
[disaster recovery] Cluster unable to recover after crash during intensive writing operations Dgraph dgraph , status:accepted , area:performance	12	1267	November 17, 2020