Kubenatas Cluster Zero sometime will be Unhealthy connection

Valdanito · April 30, 2019, 4:25am

Hi:
I run 3 Dgraph Alphas and 3 Dgraph Zeros in kubenatas, and the server can run normally.
But I feel that the system is not very stable, for example, sometimes zero can’t connection

zero-0 logs

 W0428 02:13:56.930532      10 node.go:387] Unable to send message to peer: 0x2. Error: Unhealthy connection
W0428 02:14:45.663902      10 raft.go:697] Raft.Ready took too long to process: 103ms. Most likely due to slow disk: 103ms. Num entries: 1. MustSync: true
I0428 02:15:20.591981      10 oracle.go:106] Purged below ts:200008, len(o.commits):0, len(o.rowCommit):1000
I0428 02:52:55.635546      10 oracle.go:106] Purged below ts:200056, len(o.commits):0, len(o.rowCommit):10363
I0428 02:55:29.643552      10 oracle.go:106] Purged below ts:200070, len(o.commits):0, len(o.rowCommit):41653
I0428 03:03:00.669373      10 oracle.go:106] Purged below ts:200078, len(o.commits):0, len(o.rowCommit):103605
I0428 03:10:20.693634      10 oracle.go:106] Purged below ts:200093, len(o.commits):0, len(o.rowCommit):1217
W0428 03:15:06.487733      10 raft.go:697] Raft.Ready took too long to process: 132ms. Most likely due to slow disk: 68ms. Num entries: 0. MustSync: false
W0428 03:32:20.866029      10 raft.go:697] Raft.Ready took too long to process: 207ms. Most likely due to slow disk: 207ms. Num entries: 1. MustSync: true
W0428 03:32:21.270944      10 raft.go:697] Raft.Ready took too long to process: 404ms. Most likely due to slow disk: 404ms. Num entries: 0. MustSync: false

zero-1 logs

 I0428 02:13:54.580867      10 node.go:85] 2 [logterm: 2, index: 5958] sent MsgPreVote request to 1 at term 2
I0428 02:13:54.580985      10 node.go:85] 2 [logterm: 2, index: 5958] sent MsgPreVote request to 3 at term 2
W0428 02:13:56.343446      10 node.go:636] [0x2] Read index context timed out
I0428 02:13:56.938314      10 node.go:85] 2 [term: 2] received a MsgHeartbeat message with higher term from 1 [term: 3]
I0428 02:13:56.938506      10 node.go:85] 2 became follower at term 3
I0428 02:13:56.938551      10 node.go:85] raft.node: 2 elected leader 1 at term 3
W0428 02:13:58.343708      10 node.go:636] [0x2] Read index context timed out
W0428 02:14:42.607286      10 raft.go:697] Raft.Ready took too long to process: 136ms. Most likely due to slow disk: 136ms. Num entries: 1. MustSync: true
W0428 02:14:45.648886      10 raft.go:697] Raft.Ready took too long to process: 102ms. Most likely due to slow disk: 102ms. Num entries: 1. MustSync: true
I0428 02:15:20.592201      10 oracle.go:106] Purged below ts:200008, len(o.commits):0, len(o.rowCommit):0
I0428 02:52:55.607153      10 oracle.go:106] Purged below ts:200056, len(o.commits):0, len(o.rowCommit):0
I0428 02:55:29.618570      10 oracle.go:106] Purged below ts:200070, len(o.commits):0, len(o.rowCommit):0
I0428 03:03:00.647417      10 oracle.go:106] Purged below ts:200078, len(o.commits):0, len(o.rowCommit):0
I0428 03:10:20.661362      10 oracle.go:106] Purged below ts:200093, len(o.commits):0, len(o.rowCommit):0
W0428 03:15:06.514159      10 raft.go:697] Raft.Ready took too long to process: 134ms. Most likely due to slow disk: 86ms. Num entries: 0. MustSync: false
W0428 03:32:20.925135      10 raft.go:697] Raft.Ready took too long to process: 233ms. Most likely due to slow disk: 233ms. Num entries: 1. MustSync: true
W0428 03:32:21.303296      10 raft.go:697] Raft.Ready took too long to process: 378ms. Most likely due to slow disk: 378ms. Num entries: 0. MustSync: false

zero-2 logs

 I0428 02:13:42.445888      10 oracle.go:106] Purged below ts:195444, len(o.commits):75, len(o.rowCommit):0
I0428 02:13:42.445958      10 oracle.go:106] Purged below ts:195640, len(o.commits):0, len(o.rowCommit):0
W0428 02:13:42.471062      10 raft.go:697] Raft.Ready took too long to process: 397ms. Most likely due to slow disk: 112ms. Num entries: 0. MustSync: true
I0428 02:13:43.069597      10 node.go:85] 3 no leader at term 2; dropping index reading msg
W0428 02:13:45.069733      10 node.go:636] [0x3] Read index context timed out
I0428 02:13:45.069827      10 node.go:85] 3 no leader at term 2; dropping index reading msg
I0428 02:13:45.109618      10 node.go:85] 3 is starting a new election at term 2
I0428 02:13:45.109659      10 node.go:85] 3 became pre-candidate at term 2
I0428 02:13:45.109674      10 node.go:85] 3 received MsgPreVoteResp from 3 at term 2
I0428 02:13:45.109817      10 node.go:85] 3 [logterm: 2, index: 5958] sent MsgPreVote request to 1 at term 2
I0428 02:13:45.110000      10 node.go:85] 3 [logterm: 2, index: 5958] sent MsgPreVote request to 2 at term 2 
 W0428 02:13:54.656990      10 node.go:387] Unable to send message to peer: 0x2. Error: Unhealthy connection
W0428 02:14:45.759985      10 raft.go:697] Raft.Ready took too long to process: 137ms. Most likely due to slow disk: 137ms. Num entries: 1. MustSync: true
I0428 02:15:20.668753      10 oracle.go:106] Purged below ts:200008, len(o.commits):0, len(o.rowCommit):0
I0428 02:52:55.662603      10 oracle.go:106] Purged below ts:200056, len(o.commits):0, len(o.rowCommit):0
I0428 02:55:29.655839      10 oracle.go:106] Purged below ts:200070, len(o.commits):0, len(o.rowCommit):0
I0428 03:03:00.664557      10 oracle.go:106] Purged below ts:200078, len(o.commits):0, len(o.rowCommit):0
I0428 03:10:20.641838      10 oracle.go:106] Purged below ts:200093, len(o.commits):0, len(o.rowCommit):0
W0428 03:15:06.504088      10 raft.go:697] Raft.Ready took too long to process: 101ms. Most likely due to slow disk: 52ms. Num entries: 0. MustSync: false
W0428 03:32:20.846868      10 raft.go:697] Raft.Ready took too long to process: 207ms. Most likely due to slow disk: 207ms. Num entries: 1. MustSync: true
W0428 03:32:21.251500      10 raft.go:697] Raft.Ready took too long to process: 403ms. Most likely due to slow disk: 403ms. Num entries: 0. MustSync: false
W0428 03:32:29.699624      10 raft.go:697] Raft.Ready took too long to process: 117ms. Most likely due to slow disk: 117ms. Num entries: 0. MustSync: false

Please help me analyze what caused this problem.

MichelDiz · April 30, 2019, 5:06am

Okay,

Share stats (eg. Are you using SSD?), yaml, commands made, version so on. By these logs I can’t say nothing. What steps you made to get unhealthy?

Valdanito · May 5, 2019, 3:12am

Sorry, I will only reply to you now due to holidays.
It is not running in the SSD. The Dgraph version is v1.0.14.
yaml:

apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero-public
  labels:
    app: dgraph-zero
spec:
  type: LoadBalancer
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  - port: 6080
    targetPort: 6080
    name: zero-http
  selector:
    app: dgraph-zero
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-alpha-public
  labels:
    app: dgraph-alpha
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
    name: alpha-http
  - port: 9080
    targetPort: 9080
    name: alpha-grpc
  selector:
    app: dgraph-alpha
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-ratel-public
  labels:
    app: dgraph-ratel
spec:
  type: LoadBalancer
  ports:
  - port: 8000
    targetPort: 8000
    name: ratel-http
  selector:
    app: dgraph-ratel
---
# This is a headless service which is necessary for discovery for a dgraph-zero StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero
  labels:
    app: dgraph-zero
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  clusterIP: None
  selector:
    app: dgraph-zero
---
# This is a headless service which is necessary for discovery for a dgraph-alpha StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-alpha
  labels:
    app: dgraph-alpha
spec:
  ports:
  - port: 7080
    targetPort: 7080
    name: alpha-grpc-int
  clusterIP: None
  selector:
    app: dgraph-alpha
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph-zero
spec:
  serviceName: "dgraph-zero"
  replicas: 3
  selector:
    matchLabels:
      app: dgraph-zero
  template:
    metadata:
      labels:
        app: dgraph-zero
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/path: '/debug/prometheus_metrics'
        prometheus.io/port: '6080'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-zero
              topologyKey: kubernetes.io/hostname
      containers:
      - name: zero
        image: dgraph/dgraph:v1.0.14
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5080
          name: zero-grpc
        - containerPort: 6080
          name: zero-http
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - bash
          - "-c"
          - |
            set -ex
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            idx=$(($ordinal + 1))
            if [[ $ordinal -eq 0 ]]; then
              dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 3
            else
              dgraph zero --my=$(hostname -f):5080 --peer dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 3
            fi
      terminationGracePeriodSeconds: 30
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 5Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph-alpha
spec:
  serviceName: "dgraph-alpha"
  replicas: 3
  selector:
    matchLabels:
      app: dgraph-alpha
  template:
    metadata:
      labels:
        app: dgraph-alpha
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/path: '/debug/prometheus_metrics'
        prometheus.io/port: '8080'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-alpha
              topologyKey: kubernetes.io/hostname
      initContainers:
        - name: init-alpha
          image: dgraph/dgraph:master
          command:
            - bash
            - "-c"
            - |
              echo "Write to /dgraph/doneinit when ready."
              until [ -f /dgraph/doneinit ]; do sleep 2; done
          volumeMounts:
            - name: datadir
              mountPath: /dgraph
      containers:
      - name: alpha
        image: dgraph/dgraph:v1.0.14
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: "20"
            memory: 64Gi
          requests:
            cpu: "10"
            memory: 32Gi
        ports:
        - containerPort: 7080
          name: alpha-grpc-int
        - containerPort: 8080
          name: alpha-http
        - containerPort: 9080
          name: alpha-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph alpha --my=$(hostname -f):7080 --lru_mb 21840 --query_edge_limit 3000000  --zero ${DGRAPH_ZERO_PUBLIC_PORT_5080_TCP_ADDR}:5080
      terminationGracePeriodSeconds: 30
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dgraph-ratel
  labels:
    app: dgraph-ratel
spec:
  selector:
    matchLabels:
      app: dgraph-ratel
  template:
    metadata:
      labels:
        app: dgraph-ratel
    spec:
      containers:
      - name: ratel
        image: dgraph/dgraph:v1.0.14
        ports:
        - containerPort: 8000
        command:
          - dgraph-ratel

Commands made:

kubectl create -f dgraph-ha.yaml
#Bulk Loader
kubectl cp /home/user/rdfData/ dgraph-zero-0:/dgraph/
kubectl exec -it dgraph-zero-0 sh
dgraph bulk -r rdfDoc/ -s allSchema.schema --map_shards=1 --reduce_shards=1 --zero=localhost:5080
kubectl cp dgraph-zero-0:/dgraph/out/ /home/user/bulkData/
# Initializing the Alphas
kubectl cp /home/user/bulkData/0/p/ dgraph-alpha-0:/dgraph/ -c init-alpha
kubectl exec dgraph-alpha-0 -c init-alpha touch /dgraph/doneinit

K8s Cluster can run normally at first, but an error will occur after a while (zero unhealthy connection).
Restart zero pod can fix it.

MichelDiz · May 5, 2019, 4:11am

Why you’re using master here and dgraph/dgraph:v1.0.14 in others? They’re not compatible at all.

Valdanito · May 5, 2019, 6:32am

Previously used version master Dgraph to test,Forgot to modify this configuration when upgrading the version.
Thank you for pointing out the problem. I will upgrade pods, test if there is still this problem

MichelDiz · May 5, 2019, 2:57pm

btw, master isn’t updated often.

Valdanito · May 6, 2019, 3:31am

I have upgraded the initContainers to 1.0.14. But the Zero still unhealthy connection when I ready to import data with Live loader.

dgraph live -r rdf.rdf -d 192.168.31.249:9080 -z 192.168.31.248:5080
Dgraph version   : v1.0.14
Commit SHA-1     : 26cb2f94
Commit timestamp : 2019-04-12 13:21:56 -0700
Branch           : HEAD
Go version       : go1.11.5

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.hypermode.com.
To say hi to the community       , visit https://dgraph.slack.com.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2018 Dgraph Labs, Inc.
Creating temp client directory at /tmp/x308833140
badger 2019/05/06 10:17:49 INFO: All 0 tables opened in 0s
2019/05/06 10:17:59 Unable to connect to zero, Is it running at 192.168.31.248:5080? error: context deadline exceeded

I guess it is a problem with k8s NFS (Network File System). Because the Zero log prompts

Raft.Ready took too long to process: 101ms. Most likely due to slow disk: 52ms.

MichelDiz · May 6, 2019, 5:44am

Try this How to: Live Load distributed with Kubernetes, Docker or Binary
or try my script for k8s GitHub - MichelDiz/Dgraph-Bulk-Script: Just a simple Sh to use Dgraph's Bulk Loader. but in this case using bulk instead of live load.

system · June 5, 2019, 5:44am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to run dgraph in a multi-node kubernetes cluster Users	47	5072	March 14, 2018
Failing to connect Alpha with Zero - Need help in setting the docker cluster setup Dgraph	17	2049	June 23, 2020
DGraph deployment via helm not working anymore Users	4	1622	January 7, 2019
"Please retry again, server is not ready to accept requests" should i be expecting this on a regular basis? Users	20	3126	May 16, 2020
Dgraph runs into a error loop and freezes the host Users	20	2225	February 21, 2018

Kubenatas Cluster Zero sometime will be Unhealthy connection

Related topics