Kubenatas Cluster Zero sometime will be Unhealthy connection

Hi:
I run 3 Dgraph Alphas and 3 Dgraph Zeros in kubenatas, and the server can run normally.
But I feel that the system is not very stable, for example, sometimes zero can’t connection

zero-0 logs

 W0428 02:13:56.930532      10 node.go:387] Unable to send message to peer: 0x2. Error: Unhealthy connection
W0428 02:14:45.663902      10 raft.go:697] Raft.Ready took too long to process: 103ms. Most likely due to slow disk: 103ms. Num entries: 1. MustSync: true
I0428 02:15:20.591981      10 oracle.go:106] Purged below ts:200008, len(o.commits):0, len(o.rowCommit):1000
I0428 02:52:55.635546      10 oracle.go:106] Purged below ts:200056, len(o.commits):0, len(o.rowCommit):10363
I0428 02:55:29.643552      10 oracle.go:106] Purged below ts:200070, len(o.commits):0, len(o.rowCommit):41653
I0428 03:03:00.669373      10 oracle.go:106] Purged below ts:200078, len(o.commits):0, len(o.rowCommit):103605
I0428 03:10:20.693634      10 oracle.go:106] Purged below ts:200093, len(o.commits):0, len(o.rowCommit):1217
W0428 03:15:06.487733      10 raft.go:697] Raft.Ready took too long to process: 132ms. Most likely due to slow disk: 68ms. Num entries: 0. MustSync: false
W0428 03:32:20.866029      10 raft.go:697] Raft.Ready took too long to process: 207ms. Most likely due to slow disk: 207ms. Num entries: 1. MustSync: true
W0428 03:32:21.270944      10 raft.go:697] Raft.Ready took too long to process: 404ms. Most likely due to slow disk: 404ms. Num entries: 0. MustSync: false 

zero-1 logs

 I0428 02:13:54.580867      10 node.go:85] 2 [logterm: 2, index: 5958] sent MsgPreVote request to 1 at term 2
I0428 02:13:54.580985      10 node.go:85] 2 [logterm: 2, index: 5958] sent MsgPreVote request to 3 at term 2
W0428 02:13:56.343446      10 node.go:636] [0x2] Read index context timed out
I0428 02:13:56.938314      10 node.go:85] 2 [term: 2] received a MsgHeartbeat message with higher term from 1 [term: 3]
I0428 02:13:56.938506      10 node.go:85] 2 became follower at term 3
I0428 02:13:56.938551      10 node.go:85] raft.node: 2 elected leader 1 at term 3
W0428 02:13:58.343708      10 node.go:636] [0x2] Read index context timed out
W0428 02:14:42.607286      10 raft.go:697] Raft.Ready took too long to process: 136ms. Most likely due to slow disk: 136ms. Num entries: 1. MustSync: true
W0428 02:14:45.648886      10 raft.go:697] Raft.Ready took too long to process: 102ms. Most likely due to slow disk: 102ms. Num entries: 1. MustSync: true
I0428 02:15:20.592201      10 oracle.go:106] Purged below ts:200008, len(o.commits):0, len(o.rowCommit):0
I0428 02:52:55.607153      10 oracle.go:106] Purged below ts:200056, len(o.commits):0, len(o.rowCommit):0
I0428 02:55:29.618570      10 oracle.go:106] Purged below ts:200070, len(o.commits):0, len(o.rowCommit):0
I0428 03:03:00.647417      10 oracle.go:106] Purged below ts:200078, len(o.commits):0, len(o.rowCommit):0
I0428 03:10:20.661362      10 oracle.go:106] Purged below ts:200093, len(o.commits):0, len(o.rowCommit):0
W0428 03:15:06.514159      10 raft.go:697] Raft.Ready took too long to process: 134ms. Most likely due to slow disk: 86ms. Num entries: 0. MustSync: false
W0428 03:32:20.925135      10 raft.go:697] Raft.Ready took too long to process: 233ms. Most likely due to slow disk: 233ms. Num entries: 1. MustSync: true
W0428 03:32:21.303296      10 raft.go:697] Raft.Ready took too long to process: 378ms. Most likely due to slow disk: 378ms. Num entries: 0. MustSync: false 

zero-2 logs

 I0428 02:13:42.445888      10 oracle.go:106] Purged below ts:195444, len(o.commits):75, len(o.rowCommit):0
I0428 02:13:42.445958      10 oracle.go:106] Purged below ts:195640, len(o.commits):0, len(o.rowCommit):0
W0428 02:13:42.471062      10 raft.go:697] Raft.Ready took too long to process: 397ms. Most likely due to slow disk: 112ms. Num entries: 0. MustSync: true
I0428 02:13:43.069597      10 node.go:85] 3 no leader at term 2; dropping index reading msg
W0428 02:13:45.069733      10 node.go:636] [0x3] Read index context timed out
I0428 02:13:45.069827      10 node.go:85] 3 no leader at term 2; dropping index reading msg
I0428 02:13:45.109618      10 node.go:85] 3 is starting a new election at term 2
I0428 02:13:45.109659      10 node.go:85] 3 became pre-candidate at term 2
I0428 02:13:45.109674      10 node.go:85] 3 received MsgPreVoteResp from 3 at term 2
I0428 02:13:45.109817      10 node.go:85] 3 [logterm: 2, index: 5958] sent MsgPreVote request to 1 at term 2
I0428 02:13:45.110000      10 node.go:85] 3 [logterm: 2, index: 5958] sent MsgPreVote request to 2 at term 2 
 W0428 02:13:54.656990      10 node.go:387] Unable to send message to peer: 0x2. Error: Unhealthy connection
W0428 02:14:45.759985      10 raft.go:697] Raft.Ready took too long to process: 137ms. Most likely due to slow disk: 137ms. Num entries: 1. MustSync: true
I0428 02:15:20.668753      10 oracle.go:106] Purged below ts:200008, len(o.commits):0, len(o.rowCommit):0
I0428 02:52:55.662603      10 oracle.go:106] Purged below ts:200056, len(o.commits):0, len(o.rowCommit):0
I0428 02:55:29.655839      10 oracle.go:106] Purged below ts:200070, len(o.commits):0, len(o.rowCommit):0
I0428 03:03:00.664557      10 oracle.go:106] Purged below ts:200078, len(o.commits):0, len(o.rowCommit):0
I0428 03:10:20.641838      10 oracle.go:106] Purged below ts:200093, len(o.commits):0, len(o.rowCommit):0
W0428 03:15:06.504088      10 raft.go:697] Raft.Ready took too long to process: 101ms. Most likely due to slow disk: 52ms. Num entries: 0. MustSync: false
W0428 03:32:20.846868      10 raft.go:697] Raft.Ready took too long to process: 207ms. Most likely due to slow disk: 207ms. Num entries: 1. MustSync: true
W0428 03:32:21.251500      10 raft.go:697] Raft.Ready took too long to process: 403ms. Most likely due to slow disk: 403ms. Num entries: 0. MustSync: false
W0428 03:32:29.699624      10 raft.go:697] Raft.Ready took too long to process: 117ms. Most likely due to slow disk: 117ms. Num entries: 0. MustSync: false 

Please help me analyze what caused this problem.

Okay,

Share stats (eg. Are you using SSD?), yaml, commands made, version so on. By these logs I can’t say nothing. What steps you made to get unhealthy?

Sorry, I will only reply to you now due to holidays.
It is not running in the SSD. The Dgraph version is v1.0.14.
yaml:

apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero-public
  labels:
    app: dgraph-zero
spec:
  type: LoadBalancer
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  - port: 6080
    targetPort: 6080
    name: zero-http
  selector:
    app: dgraph-zero
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-alpha-public
  labels:
    app: dgraph-alpha
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
    name: alpha-http
  - port: 9080
    targetPort: 9080
    name: alpha-grpc
  selector:
    app: dgraph-alpha
---
apiVersion: v1
kind: Service
metadata:
  name: dgraph-ratel-public
  labels:
    app: dgraph-ratel
spec:
  type: LoadBalancer
  ports:
  - port: 8000
    targetPort: 8000
    name: ratel-http
  selector:
    app: dgraph-ratel
---
# This is a headless service which is necessary for discovery for a dgraph-zero StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-zero
  labels:
    app: dgraph-zero
spec:
  ports:
  - port: 5080
    targetPort: 5080
    name: zero-grpc
  clusterIP: None
  selector:
    app: dgraph-zero
---
# This is a headless service which is necessary for discovery for a dgraph-alpha StatefulSet.
# https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#creating-a-statefulset
apiVersion: v1
kind: Service
metadata:
  name: dgraph-alpha
  labels:
    app: dgraph-alpha
spec:
  ports:
  - port: 7080
    targetPort: 7080
    name: alpha-grpc-int
  clusterIP: None
  selector:
    app: dgraph-alpha
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph-zero
spec:
  serviceName: "dgraph-zero"
  replicas: 3
  selector:
    matchLabels:
      app: dgraph-zero
  template:
    metadata:
      labels:
        app: dgraph-zero
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/path: '/debug/prometheus_metrics'
        prometheus.io/port: '6080'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-zero
              topologyKey: kubernetes.io/hostname
      containers:
      - name: zero
        image: dgraph/dgraph:v1.0.14
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5080
          name: zero-grpc
        - containerPort: 6080
          name: zero-http
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - bash
          - "-c"
          - |
            set -ex
            [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
            ordinal=${BASH_REMATCH[1]}
            idx=$(($ordinal + 1))
            if [[ $ordinal -eq 0 ]]; then
              dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 3
            else
              dgraph zero --my=$(hostname -f):5080 --peer dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 3
            fi
      terminationGracePeriodSeconds: 30
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 5Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph-alpha
spec:
  serviceName: "dgraph-alpha"
  replicas: 3
  selector:
    matchLabels:
      app: dgraph-alpha
  template:
    metadata:
      labels:
        app: dgraph-alpha
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/path: '/debug/prometheus_metrics'
        prometheus.io/port: '8080'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - dgraph-alpha
              topologyKey: kubernetes.io/hostname
      initContainers:
        - name: init-alpha
          image: dgraph/dgraph:master
          command:
            - bash
            - "-c"
            - |
              echo "Write to /dgraph/doneinit when ready."
              until [ -f /dgraph/doneinit ]; do sleep 2; done
          volumeMounts:
            - name: datadir
              mountPath: /dgraph
      containers:
      - name: alpha
        image: dgraph/dgraph:v1.0.14
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: "20"
            memory: 64Gi
          requests:
            cpu: "10"
            memory: 32Gi
        ports:
        - containerPort: 7080
          name: alpha-grpc-int
        - containerPort: 8080
          name: alpha-http
        - containerPort: 9080
          name: alpha-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        env:
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph alpha --my=$(hostname -f):7080 --lru_mb 21840 --query_edge_limit 3000000  --zero ${DGRAPH_ZERO_PUBLIC_PORT_5080_TCP_ADDR}:5080
      terminationGracePeriodSeconds: 30
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dgraph-ratel
  labels:
    app: dgraph-ratel
spec:
  selector:
    matchLabels:
      app: dgraph-ratel
  template:
    metadata:
      labels:
        app: dgraph-ratel
    spec:
      containers:
      - name: ratel
        image: dgraph/dgraph:v1.0.14
        ports:
        - containerPort: 8000
        command:
          - dgraph-ratel

Commands made:

kubectl create -f dgraph-ha.yaml
#Bulk Loader
kubectl cp /home/user/rdfData/ dgraph-zero-0:/dgraph/
kubectl exec -it dgraph-zero-0 sh
dgraph bulk -r rdfDoc/ -s allSchema.schema --map_shards=1 --reduce_shards=1 --zero=localhost:5080
kubectl cp dgraph-zero-0:/dgraph/out/ /home/user/bulkData/
# Initializing the Alphas
kubectl cp /home/user/bulkData/0/p/ dgraph-alpha-0:/dgraph/ -c init-alpha
kubectl exec dgraph-alpha-0 -c init-alpha touch /dgraph/doneinit

K8s Cluster can run normally at first, but an error will occur after a while (zero unhealthy connection).
Restart zero pod can fix it.

Why you’re using master here and dgraph/dgraph:v1.0.14 in others? They’re not compatible at all.

Previously used version master Dgraph to test,Forgot to modify this configuration when upgrading the version.
Thank you for pointing out the problem. I will upgrade pods, test if there is still this problem

btw, master isn’t updated often.

I have upgraded the initContainers to 1.0.14. But the Zero still unhealthy connection when I ready to import data with Live loader.

dgraph live -r rdf.rdf -d 192.168.31.249:9080 -z 192.168.31.248:5080
Dgraph version   : v1.0.14
Commit SHA-1     : 26cb2f94
Commit timestamp : 2019-04-12 13:21:56 -0700
Branch           : HEAD
Go version       : go1.11.5

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2018 Dgraph Labs, Inc.
Creating temp client directory at /tmp/x308833140
badger 2019/05/06 10:17:49 INFO: All 0 tables opened in 0s
2019/05/06 10:17:59 Unable to connect to zero, Is it running at 192.168.31.248:5080? error: context deadline exceeded

I guess it is a problem with k8s NFS (Network File System). Because the Zero log prompts

Raft.Ready took too long to process: 101ms. Most likely due to slow disk: 52ms.

Try this How to: Live Load distributed with Kubernetes, Docker or Binary
or try my script for k8s GitHub - MichelDiz/Dgraph-Bulk-Script: Just a simple Sh to use Dgraph's Bulk Loader. but in this case using bulk instead of live load.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.