DGraph deployment via helm not working anymore

aurel · December 4, 2018, 9:50pm

task:

I am trying to deploy dgraph (one zero and one alpha) to kubernetes (google cloud) via helm chart.

problem:
it used to work, now it no longer does. I do not see what is different. The specific error is best described in the logs below. Essentially it seems like a grpc / connection problem. It first appeared after i set the gcloud cluster size (# of nodes) to 0 and some days later back to 4 but I find it hard to believe that that should be the cause. I am not very familiar with these kinds of problems and the person who set the whole thing up is no longer available. I’m posting here, because I think it might be a dgraph problem, but I am not certain.

What I have tried to solve the problem:
delete the release via helm (helm delete --purge dgraph) and recreate (helm install --wait --name dgraph ./charts/dgraph/). I also tried setting gcloud cluster size to 0 and back to 4. no difference. I went over the configuration and it seems fine to me. compared it to compose files I found in various places including the dgraph repo.

below you find the logs and the chart specification.

Any help is really appreciated!

Thanks!

Aurel

zero log:
I1204 21:27:51.539624       1 run.go:90] Setting up grpc listener at: 0.0.0.0:5080
I1204 21:27:51.539833       1 run.go:90] Setting up http listener at: 0.0.0.0:6080
badger2018/12/04 21:27:51 INFO: Replaying file id: 0 at offset: 1544608
badger2018/12/04 21:27:51 INFO: Replay took: 15.256µs
I1204 21:27:51.888823       1 node.go:152] Setting raft.Config to: &{ID:1 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc00015de10 Applied:0 MaxSizePerMsg:1048576 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1d112c0}
I1204 21:27:51.892352       1 node.go:282] Found hardstate: {Term:27 Vote:1 Commit:6525 XXX_unrecognized:[]}
I1204 21:27:51.897997       1 node.go:291] Group 0 found 6526 entries
I1204 21:27:51.898218       1 raft.go:371] Restarting node for dgraphzero
I1204 21:27:51.898497       1 node.go:84] 1 became follower at term 27
I1204 21:27:51.898744       1 node.go:84] newRaft 1 [peers: [], term: 27, commit: 6525, applied: 0, lastindex: 6525, lastterm: 27]
I1204 21:27:51.902606       1 run.go:229] Running Dgraph Zero...
I1204 21:27:51.919236       1 node.go:174] Setting conf state to nodes:1
I1204 21:27:51.919599       1 raft.go:547] Done applying conf change at 1
E1204 21:27:51.921113       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:7080: connect: connection refused"
I1204 21:27:51.921902       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:7080
E1204 21:27:51.921301       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:7080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:7080: connect: connection refused"
I1204 21:27:51.923212       1 raft.go:272] Removing tablet for attr: [value_date], gid: [1]
E1204 21:27:51.923984       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924075       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924149       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924210       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924265       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924308       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:51.924366       1 raft.go:552] While applying proposal: Invalid address
...
E1204 21:27:52.207869       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:52.207873       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:52.205514       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:9080: connect: connection refused"
I1204 21:27:52.207897       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:9080
E1204 21:27:52.205566       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.0.11.6:9080: connect: connection refused"
I1204 21:27:52.380095       1 zero.go:375] Got connection request: id:6062 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080"
I1204 21:27:52.380886       1 zero.go:484] Connected: id:6062 addr:"dgraph-0.dgraph.default.svc.cluster.local:7080"
I1204 21:27:52.392898       1 node.go:84] 1 no leader at term 27; dropping index reading msg
I1204 21:27:54.480961       1 node.go:84] 1 is starting a new election at term 27
I1204 21:27:54.481005       1 node.go:84] 1 became pre-candidate at term 27
I1204 21:27:54.481017       1 node.go:84] 1 received MsgPreVoteResp from 1 at term 27
I1204 21:27:54.481102       1 node.go:84] 1 became candidate at term 28
I1204 21:27:54.481112       1 node.go:84] 1 received MsgVoteResp from 1 at term 28
I1204 21:27:54.481218       1 node.go:84] 1 became leader at term 28
I1204 21:27:54.481232       1 node.go:84] raft.node: 1 elected leader 1 at term 28
E1204 21:27:54.483865       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:54.483928       1 zero.go:549] Error while applying proposal in update stream Invalid address
E1204 21:27:54.716975       1 raft.go:552] While applying proposal: Invalid address
E1204 21:27:54.717231       1 zero.go:549] Error while applying proposal in update stream Invalid address
W1204 21:27:55.393083       1 node.go:551] [1] Read index context timed out
E1204 21:28:02.208789       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft
E1204 21:28:02.209086       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft
E1204 21:28:21.892166       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:28:51.893023       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:29:21.892887       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:29:51.892775       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:30:21.892814       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:30:51.892810       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:31:21.892858       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:31:51.892803       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:32:21.892885       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:32:51.892669       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:32:52.417618       1 raft.go:552] While applying proposal: Invalid address
E1204 21:32:52.417962       1 zero.go:549] Error while applying proposal in update stream Invalid address
E1204 21:33:21.892766       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:33:51.892865       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:34:21.892804       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:34:51.892788       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:35:21.892866       1 oracle.go:425] No healthy connection found to leader of group 2
I1204 21:35:51.892321       1 tablet.go:189]

Groups sorted by size: [{gid:2 size:0} {gid:1 size:80673}]

I1204 21:35:51.892359       1 tablet.go:194] size_diff 80673
I1204 21:35:51.892391       1 tablet.go:83] Going to move predicate: [_predicate_], size: [32 kB] from group 1 to 2
E1204 21:35:51.893181       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:35:51.917329       1 tablet.go:231] Got error during move: While calling MovePredicate: rpc error: code = Unknown desc = Group id doesn't match, received request for 1, my gid: 2
E1204 21:35:51.919971       1 tablet.go:70] Error while trying to move predicate _predicate_ from 1 to 2: While calling MovePredicate: rpc error: code = Unknown desc = Group id doesn't match, received request for 1, my gid: 2

E1204 21:36:21.892883       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:36:51.892766       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:37:21.892853       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:37:51.892927       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:37:52.420512       1 raft.go:552] While applying proposal: Invalid address
E1204 21:37:52.420817       1 zero.go:549] Error while applying proposal in update stream Invalid address
E1204 21:38:21.892801       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:38:51.892913       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:39:21.892727       1 oracle.go:425] No healthy connection found to leader of group 2
E1204 21:39:51.892272       1 oracle.go:425] No healthy connection found to leader of group 2```

alpha log:
++ hostname -f
+ dgraph alpha --my=dgraph-0.dgraph.default.svc.cluster.local:7080 --lru_mb 2048 --zero dgraph-0.dgraph.default.svc.cluster.local:5080
I1204 21:27:52.274206       1 init.go:80]

Dgraph version   : v1.0.10
Commit SHA-1     : 8b801bd7
Commit timestamp : 2018-11-05 17:52:33 -0800
Branch           : HEAD

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.hypermode.com.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed under Apache 2.0. Copyright 2015-2018 Dgraph Labs, Inc.


I1204 21:27:52.295997       1 server.go:115] Setting Badger table load option: mmap
I1204 21:27:52.296163       1 server.go:127] Setting Badger value log load option: mmap
I1204 21:27:52.296229       1 server.go:155] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:true TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:65500 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:true}
badger2018/12/04 21:27:52 INFO: Replaying file id: 0 at offset: 12977
badger2018/12/04 21:27:52 INFO: Replay took: 10.567µs
I1204 21:27:52.322077       1 server.go:115] Setting Badger table load option: mmap
I1204 21:27:52.322103       1 server.go:127] Setting Badger value log load option: mmap
I1204 21:27:52.322108       1 server.go:169] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:true TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:true}
badger2018/12/04 21:27:52 INFO: Replaying file id: 0 at offset: 0
badger2018/12/04 21:27:52 INFO: Replay took: 18.232µs
I1204 21:27:52.376726       1 run.go:338] gRPC server started.  Listening on port 9080
I1204 21:27:52.376848       1 run.go:339] HTTP server started.  Listening on port 8080
I1204 21:27:52.377184       1 groups.go:92] Current Raft Id: 6062
I1204 21:27:52.377898       1 worker.go:80] Worker listening at address: [::]:7080
I1204 21:27:52.379669       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:5080
I1204 21:27:52.381207       1 groups.go:119] Connected to group zero. Assigned group: 0
E1204 21:27:52.382305       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft
I1204 21:27:52.382655       1 pool.go:118] CONNECTED to dgraph-0.dgraph.default.svc.cluster.local:9080
I1204 21:27:52.390886       1 draft.go:74] Node ID: 6062 with GroupID: 2
I1204 21:27:52.391199       1 node.go:152] Setting raft.Config to: &{ID:6062 peers:[] ElectionTick:100 HeartbeatTick:1 Storage:0xc00008fe10 Applied:22 MaxSizePerMsg:1048576 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1d112c0}
I1204 21:27:52.391360       1 node.go:271] Found Snapshot.Metadata: {ConfState:{Nodes:[6062] XXX_unrecognized:[]} Index:22 Term:11 XXX_unrecognized:[]}
I1204 21:27:52.391445       1 node.go:282] Found hardstate: {Term:12 Vote:6062 Commit:25 XXX_unrecognized:[]}
I1204 21:27:52.391534       1 node.go:291] Group 2 found 4 entries
I1204 21:27:52.391574       1 draft.go:1047] Restarting node for group: 2
I1204 21:27:52.391638       1 node.go:174] Setting conf state to nodes:6062
I1204 21:27:52.391909       1 node.go:84] 17ae became follower at term 12
I1204 21:27:52.392015       1 node.go:84] newRaft 17ae [peers: [17ae], term: 12, commit: 25, applied: 22, lastindex: 25, lastterm: 12]
I1204 21:27:52.392285       1 groups.go:519] Got address of a Zero server: dgraph-0.dgraph.default.svc.cluster.local:5080
I1204 21:27:52.394939       1 draft.go:313] Skipping snapshot at 22, because found one at 22
I1204 21:27:54.712797       1 node.go:84] 17ae is starting a new election at term 12
I1204 21:27:54.713220       1 node.go:84] 17ae became pre-candidate at term 12
I1204 21:27:54.713303       1 node.go:84] 17ae received MsgPreVoteResp from 17ae at term 12
I1204 21:27:54.713474       1 node.go:84] 17ae became candidate at term 13
I1204 21:27:54.713564       1 node.go:84] 17ae received MsgVoteResp from 17ae at term 13
I1204 21:27:54.713821       1 node.go:84] 17ae became leader at term 13
I1204 21:27:54.713954       1 node.go:84] raft.node: 17ae elected leader 17ae at term 13
I1204 21:27:55.392399       1 groups.go:718] Leader idx=6062 of group=2 is connecting to Zero for txn updates
W1204 21:27:55.392803       1 groups.go:723] WARNING: We don't have address of any dgraphzero leader.
I1204 21:27:56.393134       1 groups.go:718] Leader idx=6062 of group=2 is connecting to Zero for txn updates
E1204 21:27:56.397090       1 draft.go:467] Lastcommit 10337 > current 10002. This would cause some commits to be lost.
E1204 21:28:02.383404       1 pool.go:178] Echo error from dgraph-0.dgraph.default.svc.cluster.local:9080. Err: rpc error: code = Unimplemented desc = unknown service pb.Raft

The chart is specified as follows:

statefulset.yml:
# This StatefulSet runs 1 pod with one Zero, one Server
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dgraph
spec:
  serviceName: "dgraph"
  replicas: 1
  selector:
      matchLabels:
        app: dgraph
  template:
    metadata:
      labels:
        app: dgraph
    spec:
      {{- if .Values.server.initData.image }}
      initContainers:
      - name: init-schema
        image: {{ .Values.server.initData.image }}
        command: ['curl', '-X', 'POST', '-H', 'X-Dgraph-CommitNow:true', '--data-binary', '@graph/schema.txt', '{{ .Values.service.name }}.default.svc.cluster.local/alter']
      - name: init-data
        image: {{ .Values.server.initData.image }}
        command: ['curl', '-X', 'POST', '-H', 'X-Dgraph-CommitNow:true', '--data-binary', '@graph/data.txt', '{{ .Values.service.name }}.default.svc.cluster.local/mutate']
      {{- end }}
      containers:
      - name: zero
        image: {{ template "dgraph.image" . }}
        imagePullPolicy: {{ .Values.image.pullPolicy | quote }}
        ports:
        - containerPort: {{ .Values.service.ports.zeroGrpc }}
          name: zero-grpc
        - containerPort: {{ .Values.service.ports.zeroHttp }}
          name: zero-http
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph zero --my=$(hostname -f):{{ .Values.service.ports.zeroGrpc }}
      - name: server
        image: {{ template "dgraph.image" . }}
        imagePullPolicy: {{ .Values.image.pullPolicy | quote }}
        ports:
        - containerPort: {{ .Values.service.ports.serverHttp }}
          name: server-http
        - containerPort: {{ .Values.service.ports.serverGrpc }}
          name: server-grpc
        volumeMounts:
        - name: datadir
          mountPath: /dgraph
        command:
          - bash
          - "-c"
          - |
            set -ex
            dgraph alpha --my=$(hostname -f):{{ .Values.server.port }} --lru_mb {{ .Values.server.lruSizeMB }} --zero {{ .Values.server.zeroDns }}:{{ .Values.service.ports.zeroGrpc }}
      terminationGracePeriodSeconds: 60
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: {{ .Values.storage.size }}```

values.yml:
image:
  registry: docker.io
  repository: dgraph/dgraph
  tag: latest
  pullPolicy: Always

service:
  name: dgraph-service
  ports:
    zeroGrpc: 5080
    zeroHttp: 6080
    serverHttp: 8080
    serverGrpc: 9080

server:
  # Estimate of the LRU cache size in MB. It’s recommended to set lru_mb to one-third the available RAM.
  lruSizeMB: 2048
  zeroDns: dgraph-0.dgraph.default.svc.cluster.local
  port: 7080
  initData:
    image: ""
    #image: "registry.gitlab.com/infix/taxgorillaprototype/backend:latest"

storage:
  size: 5Gi

aurel · December 5, 2018, 1:34pm

I solved the problem. It was indeed a dgraph problem. I overlooked the fact, that a persistentVolumeClaim was used for storage. Therefore deleting and reinstalling the containers didn’t solve the issue. I wiped the storage volume (i.e. i deleted p w zw folders) and voila, it all works again!

dmai · December 6, 2018, 7:04pm

Glad you got it working @aurel. You shared a helm chart in your post. Is this something that can be contributed to the community for this open issue about adding a Dgraph helm chart?

https://github.com/dgraph-io/dgraph/issues/1917

aurel · December 8, 2018, 9:26pm

Sure! However, i‘d have to look into that a bit, the persistent volume has to exist for example.

system · January 7, 2019, 9:36pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dgraph crash loop on aws Dgraph	16	1716	June 8, 2020
Dgraph cluster not initialize on v20.11.0 and v20.11.1 Dgraph	14	1483	February 17, 2022
Using Kubernetes - Deploy Documentation	1	1464	August 28, 2020
Failed to install Dgraph HA Dgraph dgraph , kind:bug	13	1205	August 11, 2021
Badger: Cannot open DB because the external magic number doesn't match. Deploying Dgraph using Helm Chart Dgraph	2	596	January 19, 2022

DGraph deployment via helm not working anymore

Related topics