DGraph deployment via helm not working anymore


(aurel) #1

task:

I am trying to deploy dgraph (one zero and one alpha) to kubernetes (google cloud) via helm chart.

problem:
it used to work, now it no longer does. I do not see what is different. The specific error is best described in the logs below. Essentially it seems like a grpc / connection problem. It first appeared after i set the gcloud cluster size (# of nodes) to 0 and some days later back to 4 but I find it hard to believe that that should be the cause. I am not very familiar with these kinds of problems and the person who set the whole thing up is no longer available. I’m posting here, because I think it might be a dgraph problem, but I am not certain.

What I have tried to solve the problem:
delete the release via helm (helm delete --purge dgraph) and recreate (helm install --wait --name dgraph ./charts/dgraph/). I also tried setting gcloud cluster size to 0 and back to 4. no difference. I went over the configuration and it seems fine to me. compared it to compose files I found in various places including the dgraph repo.

below you find the logs and the chart specification.

Any help is really appreciated!

Thanks!

Aurel

zero log:
I1204 21:27:51.539624       1 run.go:90] Setting up grpc listener at: 0.0.0.0:5080
I1204 21:27:51.539833       1 run.go:90] Setting up http listener at: 0.0.0.0:6080
badger2018/12/04 21:27:51 INFO: Replaying file id: 0 at offset: 1544608
badger2018/12/04 21:27:51 INFO: Replay took: 15.256µs
I1204 21:27:51.888823       1 node.go:152] Setting raft.Config to: &{ID:1 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc00015de10 Applied:0 MaxSizePerMsg:1048576 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1d112c0}
I1204 21:27:51.892352       1 node.go:282] Found hardstate: {Term:27 Vote:1 Commit:6525 XXX_unrecognized:[]}
I1204 21:27:51.897997       1 node.go:291] Group 0 found 6526 entries
I1204 21:27:51.898218       1 raft.go:371] Restarting node for dgraphzero
I1204 21:27:51.898497       1 node.go:84] 1 became follower at term 27
I1204 21:27:51.898744       1 node.go:84] newRaft 1 [peers: [], term: 27, commit: 6525, applied: 0, lastindex: 6525, lastterm: 27]
I1204 21:27:51.902606       1 run.go:229] Running Dgraph Zero...
I1204 21:27:51.919236       1 node.go:174] Setting conf state to nodes:1
I1204 21:27:51.919599       1 raft.go:547] Done applying conf change at 1
E1204 21:27:51.921113       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:7080: connect: connection refused"
I1204 21:27:51.921902       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:7080
E1204 21:27:51.921301       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:7080: connect: connection refused"
I1204 21:27:51.923212       1 raft.go:272] Removing tablet for attr: [value_date], gid: [1]
E1204 21:27:51.923984       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924075       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924149       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924210       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924265       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924308       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924366       1 raft.go:552] While applying proposal: Invalid address
...
E1204 21:27:52.207869       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:52.207873       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:52.205514       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:9080: connect: connection refused"
I1204 21:27:52.207897       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:9080
E1204 21:27:52.205566       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:9080: connect: connection refused"
I1204 21:27:52.380095       1 zero.go:375] Got connection request: id:6062 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080"
I1204 21:27:52.380886       1 zero.go:484] Connected: id:6062 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080"
I1204 21:27:52.392898       1 node.go:84] 1 no leader at term 27; dropping index reading msg
I1204 21:27:54.480961       1 node.go:84] 1 is starting a new election at term 27
I1204 21:27:54.481005       1 node.go:84] 1 became pre-candidate at term 27
I1204 21:27:54.481017       1 node.go:84] 1 received MsgPreVoteResp from 1 at term 27
I1204 21:27:54.481102       1 node.go:84] 1 became candidate at term 28
I1204 21:27:54.481112       1 node.go:84] 1 received MsgVoteResp from 1 at term 28
I1204 21:27:54.481218       1 node.go:84] 1 became leader at term 28
I1204 21:27:54.481232       1 node.go:84] raft.node: 1 elected leader 1 at term 28
E1204 21:27:54.483865       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:54.483928       1 zero.go:549] Error while applying proposal in update stream Invalid address
E1204 21:27:54.716975       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:54.717231       1 zero.go:549] Error while applying proposal in update stream Invalid address
W1204 21:27:55.393083       1 node.go:551] [1] Read index context timed out
E1204 21:28:02.208789       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft
E1204 21:28:02.209086       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft
E1204 21:28:21.892166       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:28:51.893023       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:29:21.892887       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:29:51.892775       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:30:21.892814       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:30:51.892810       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:31:21.892858       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:31:51.892803       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:32:21.892885       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:32:51.892669       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:32:52.417618       1 raft.go:552] While applying proposal: Invalid address
E1204 21:32:52.417962       1 zero.go:549] Error while applying proposal in update stream Invalid address
E1204 21:33:21.892766       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:33:51.892865       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:34:21.892804       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:34:51.892788       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:35:21.892866       1 oracle.go:425] No healthy connection found to leader of group 2
I1204 21:35:51.892321       1 tablet.go:189]

Groups sorted by size: [{gid:2 size:0} {gid:1 size:80673}]

I1204 21:35:51.892359       1 tablet.go:194] size_diff 80673
I1204 21:35:51.892391       1 tablet.go:83] Going to move predicate: [_predicate_], size: [32 kB] from group 1 to 2
E1204 21:35:51.893181       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:35:51.917329       1 tablet.go:231] Got error during move: While calling MovePredicate: rpc error: code = Unknown desc = Group id doesn't match, received request for 1, my gid: 2
E1204 21:35:51.919971       1 tablet.go:70] Error while trying to move predicate _predicate_ from 1 to 2: While calling MovePredicate: rpc error: code = Unknown desc = Group id doesn't match, received request for 1, my gid: 2

E1204 21:36:21.892883       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:36:51.892766       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:37:21.892853       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:37:51.892927       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:37:52.420512       1 raft.go:552] While applying proposal: Invalid address
E1204 21:37:52.420817       1 zero.go:549] Error while applying proposal in update stream Invalid address
E1204 21:38:21.892801       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:38:51.892913       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:39:21.892727       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:39:51.892272       1 oracle.go:425] No healthy connection found to leader of group 2```

alpha log:
++ hostname -f
+ dgraph alpha --my=dgraph-0.dgraph.default.svc.cluster.local:7080 --lru_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080
I1204 21:27:52.274206       1 init.go:80]

Dgraph version   : v1.0.10
Commit SHA-1     : 8b801bd7
Commit timestamp : 2018-11-05 17:52:33 -0800
Branch           : HEAD

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit https://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed under Apache 2.0. Copyright 2015-2018 Dgraph Labs, Inc.


I1204 21:27:52.295997       1 server.go:115] Setting Badger table load option: mmap
I1204 21:27:52.296163       1 server.go:127] Setting Badger value log load option: mmap
I1204 21:27:52.296229       1 server.go:155] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:true TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:65500 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:true}
badger2018/12/04 21:27:52 INFO: Replaying file id: 0 at offset: 12977
badger2018/12/04 21:27:52 INFO: Replay took: 10.567µs
I1204 21:27:52.322077       1 server.go:115] Setting Badger table load option: mmap
I1204 21:27:52.322103       1 server.go:127] Setting Badger value log load option: mmap
I1204 21:27:52.322108       1 server.go:169] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:true TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:true}
badger2018/12/04 21:27:52 INFO: Replaying file id: 0 at offset: 0
badger2018/12/04 21:27:52 INFO: Replay took: 18.232µs
I1204 21:27:52.376726       1 run.go:338] gRPC server started.  Listening on port 9080
I1204 21:27:52.376848       1 run.go:339] HTTP server started.  Listening on port 8080
I1204 21:27:52.377184       1 groups.go:92] Current Raft Id: 6062
I1204 21:27:52.377898       1 worker.go:80] Worker listening at address: [::]:7080
I1204 21:27:52.379669       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:5080
I1204 21:27:52.381207       1 groups.go:119] Connected to group zero. Assigned group: 0
E1204 21:27:52.382305       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft
I1204 21:27:52.382655       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:9080
I1204 21:27:52.390886       1 draft.go:74] Node ID: 6062 with GroupID: 2
I1204 21:27:52.391199       1 node.go:152] Setting raft.Config to: &{ID:6062 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc00008fe10 Applied:22 MaxSizePerMsg:1048576 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1d112c0}
I1204 21:27:52.391360       1 node.go:271] Found Snapshot.Metadata: {ConfState:{Nodes:[6062] XXX_unrecognized:[]} Index:22 Term:11 XXX_unrecognized:[]}
I1204 21:27:52.391445       1 node.go:282] Found hardstate: {Term:12 Vote:6062 Commit:25 XXX_unrecognized:[]}
I1204 21:27:52.391534       1 node.go:291] Group 2 found 4 entries
I1204 21:27:52.391574       1 draft.go:1047] Restarting node for group: 2
I1204 21:27:52.391638       1 node.go:174] Setting conf state to nodes:6062
I1204 21:27:52.391909       1 node.go:84] 17ae became follower at term 12
I1204 21:27:52.392015       1 node.go:84] newRaft 17ae [peers: [17ae], term: 12, commit: 25, applied: 22, lastindex: 25, lastterm: 12]
I1204 21:27:52.392285       1 groups.go:519] Got address of a Zero server: dgraph-0.dgraph.default.svc.cluster.local:5080
I1204 21:27:52.394939       1 draft.go:313] Skipping snapshot at 22, because found one at 22
I1204 21:27:54.712797       1 node.go:84] 17ae is starting a new election at term 12
I1204 21:27:54.713220       1 node.go:84] 17ae became pre-candidate at term 12
I1204 21:27:54.713303       1 node.go:84] 17ae received MsgPreVoteResp from 17ae at term 12
I1204 21:27:54.713474       1 node.go:84] 17ae became candidate at term 13
I1204 21:27:54.713564       1 node.go:84] 17ae received MsgVoteResp from 17ae at term 13
I1204 21:27:54.713821       1 node.go:84] 17ae became leader at term 13
I1204 21:27:54.713954       1 node.go:84] raft.node: 17ae elected leader 17ae at term 13
I1204 21:27:55.392399       1 groups.go:718] Leader idx=6062 of group=2 is connecting to Zero for txn updates
W1204 21:27:55.392803       1 groups.go:723] WARNING: We don't have address of any dgraphzero leader.
I1204 21:27:56.393134       1 groups.go:718] Leader idx=6062 of group=2 is connecting to Zero for txn updates
E1204 21:27:56.397090       1 draft.go:467] Lastcommit 10337 > current 10002. This would cause some commits to be lost.
E1204 21:28:02.383404       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft

The chart is specified as follows:

statefulset.yml:
# This StatefulSet runs 1 pod with one Zero, one Server
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph
spec:
  serviceName: "dgraph"
  replicas: 1
  selector:
      matchLabels:
        app: dgraph
  template:
    metadata:
      labels:
        app: dgraph
    spec:
      {{- if .Values.server.initData.image }}
      initContainers:
      - name: init-schema
        image: {{ .Values.server.initData.image }}
        command: ['curl', '-X', 'POST', '-H', 'X-Dgraph-CommitNow:true', '--data-binary', '@graph/schema.txt', '{{ .Values.service.name }}.default.svc.cluster.local/alter']
      - name: init-data
        image: {{ .Values.server.initData.image }}
        command: ['curl', '-X', 'POST', '-H', 'X-Dgraph-CommitNow:true', '--data-binary', '@graph/data.txt', '{{ .Values.service.name }}.default.svc.cluster.local/mutate']
      {{- end }}
      containers:
      - name: zero
        image: {{ template "dgraph.image" . }}
        imagePullPolicy: {{ .Values.image.pullPolicy | quote }}
        ports:
        - containerPort: {{ .Values.service.ports.zeroGrpc }}
          name: zero-grpc
        - containerPort: {{ .Values.service.ports.zeroHttp }}
          name: zero-http
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph zero --my=$(hostname -f):{{ .Values.service.ports.zeroGrpc }}
      - name: server
        image: {{ template "dgraph.image" . }}
        imagePullPolicy: {{ .Values.image.pullPolicy | quote }}
        ports:
        - containerPort: {{ .Values.service.ports.serverHttp }}
          name: server-http
        - containerPort: {{ .Values.service.ports.serverGrpc }}
          name: server-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph alpha --my=$(hostname -f):{{ .Values.server.port }} --lru_mb {{ .Values.server.lruSizeMB }} --zero {{ .Values.server.zeroDns }}:{{ .Values.service.ports.zeroGrpc }}
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: {{ .Values.storage.size }}```

values.yml:
image:
  registry: docker.io
  repository: dgraph/dgraph
  tag: latest
  pullPolicy: Always

service:
  name: dgraph-service
  ports:
    zeroGrpc: 5080
    zeroHttp: 6080
    serverHttp: 8080
    serverGrpc: 9080

server:
  # Estimate of the LRU cache size in MB. It’s recommended to set lru_mb to one-third the available RAM.
  lruSizeMB: 2048
  zeroDns: dgraph-0.dgraph.default.svc.cluster.local
  port: 7080
  initData:
    image: ""
    #image: "registry.gitlab.com/infix/taxgorillaprototype/backend:latest"

storage:
  size: 5Gi

(aurel) #2

I solved the problem. It was indeed a dgraph problem. I overlooked the fact, that a persistentVolumeClaim was used for storage. Therefore deleting and reinstalling the containers didn’t solve the issue. I wiped the storage volume (i.e. i deleted p w zw folders) and voila, it all works again!


(Daniel Mai) #3

Glad you got it working @aurel. You shared a helm chart in your post. Is this something that can be contributed to the community for this open issue about adding a Dgraph helm chart?


(aurel) #4

Sure! However, i‘d have to look into that a bit, the persistent volume has to exist for example.