Alpha is failing on start

lukaszlenart · November 19, 2020, 10:31am

I set up a Dgraph cluster locally using minikube, 3x Alphas, 3x Zeros - everything was fine. Now I have scaled down all Alphas to 0

kubectl scale statefulset proj-graph-engine --replicas=0

then removed all the pvcs and pvs related to those Alphas and now when I’m scaling up the Alphas I get

...
I1119 10:26:06.414946      16 draft.go:1505] Calling IsPeer
E1119 10:26:06.415704      16 draft.go:1538] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
...

I’m using Dgraph v20.03.1

lukaszlenart · November 19, 2020, 11:47am

I have started a new cluster using Dgraph v20.03.6 and now after dropping PVCs and scaling Alphas up this shows up in the logs:

[pod/proj-graph-engine-zero-0/proj-graph-engine-zero] I1119 11:44:46.514916      17 zero.go:440] Connected: cluster_info_only:true
[pod/proj-graph-engine-1/proj-graph-engine] I1119 11:44:47.374876      15 draft.go:1543] Calling IsPeer
[pod/proj-graph-engine-1/proj-graph-engine] E1119 11:44:47.380827      15 draft.go:1576] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-zero-0/proj-graph-engine-zero] I1119 11:44:47.517125      17 zero.go:422] Got connection request: cluster_info_only:true
[pod/proj-graph-engine-0/proj-graph-engine] E1119 11:44:47.523360      18 draft.go:1576] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

lukaszlenart · November 19, 2020, 11:56am

And again, using Dgraph v20.07.2 and the same operation, scale down, remove PVC, scale up and then

[pod/proj-graph-engine-0/proj-graph-engine] I1119 11:55:31.892497      16 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-0/proj-graph-engine] E1119 11:55:31.894925      16 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-1/proj-graph-engine] I1119 11:55:31.990508      17 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-1/proj-graph-engine] E1119 11:55:31.996177      17 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-2/proj-graph-engine] I1119 11:55:32.865837      32 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-2/proj-graph-engine] E1119 11:55:32.879186      32 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-0/proj-graph-engine] I1119 11:55:32.896428      16 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-0/proj-graph-engine] E1119 11:55:32.904441      16 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-1/proj-graph-engine] I1119 11:55:32.996402      17 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-1/proj-graph-engine] E1119 11:55:33.000037      17 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-2/proj-graph-engine] I1119 11:55:33.880791      32 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-2/proj-graph-engine] E1119 11:55:33.886389      32 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-0/proj-graph-engine] I1119 11:55:33.904971      16 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-0/proj-graph-engine] E1119 11:55:33.912895      16 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...
[pod/proj-graph-engine-1/proj-graph-engine] I1119 11:55:34.004921      17 draft.go:1584] Calling IsPeer
[pod/proj-graph-engine-1/proj-graph-engine] E1119 11:55:34.010401      17 draft.go:1617] Error while calling hasPeer: error while joining cluster: rpc error: code = Unknown desc = No node has been set up yet. Retrying...

lukaszlenart · November 19, 2020, 12:01pm

This means the official Upgrade Database procedure won’t work anymore
https://dgraph.io/docs/deploy/dgraph-administration/#upgrading-database

MichelDiz · November 23, 2020, 5:26pm

I have accepted this to someone on the team to take a look.

joaquin · November 23, 2020, 6:33pm

I am looking at this right now, starting with v20.03.6.

dmai · November 23, 2020, 6:41pm

@lukaszlenart Did you remove all the Alphas but keep the Zeros? If you’re looking to restart the cluster from scratch you’ll want to start from a clean slate (i.e., new data directories) for all Zeros and Alphas.

joaquin · November 23, 2020, 7:00pm

@lukaszlenart For this explicit process, using the dgraph helm chart, you could the following:

helm install pge --set image.tag=v20.03.6 dgraph/dgraph

## Scale Down Cluster and Delete Data + State
kubectl scale statefulset pge-dgraph-alpha --replicas=0
kubectl scale statefulset pge-dgraph-zero --replicas=0
kubectl delete pvc --selector release=pge

## Scale Up Cluster Starting with Zeros
kubectl scale statefulset pge-dgraph-zero --replicas=3

## Wait until 3 x healthy zero nodes
kubectl scale statefulset pge-dgraph-alpha --replicas=3

For the bulk loader, on an empty cluster, you would want to use an init container for bulk loader.

Generally, for immutable infrastructure patterns, It may be easier to just delete the statefulsets and recreate them from scratch again. With a helm chart used above, that process would be:

helm delete pge
kubectl delete pvc --selector release=pge

lukaszlenart · November 23, 2020, 7:02pm

@dmai yes, I removed just Alphas, but then I removed Alphas and Zeros - problem persisted. The main issue is that, when you copied in all files from bulk loader to all the Alphas and then shut them down, they will start complaining in logs about file already exists and the cluster is broken.

@joaquin what do you mean by initContainers? copy data in?

joaquin · November 23, 2020, 7:10pm

@lukaszlenart Correct. On each of the Alphas, you’d have an initContainer, then do a spin-loop until you finish the bulk load and move directory created to p.

      command:
        - bash
        - "-c"
        - |
          trap "exit" SIGINT SIGTERM
          echo "Write to /dgraph/doneinit when ready."
          until [ -f /dgraph/doneinit ]; do sleep 2; done

Then kubectl cp file(s) into initContainer on alpha-0 pod (or from within the initContainer, curl it down), do the bulk load, touch /dgraph/doneinit. Do this same process for alpha-1, then alpha-2.

lukaszlenart · November 23, 2020, 7:13pm

@joaquin thanks a lot, that should work!

joaquin · November 23, 2020, 8:59pm

@lukaszlenart As an example, I added initContainer automation in the current master of dgraph helm chart. If you wanted to use this, you could do the following.

Get the Chart

git clone https://github.com/dgraph-io/charts.git

REL="pge"
helm install "REL" \
 --set image.tag=v20.03.6 \
 --set alpha.initContainers.init.enabled=true \
 ./charts/charts/dgraph/

Copy Data to InitContainer

I used a sample dataset:

mkdir 1million && pushd 1million
PREFIX=https://github.com/dgraph-io/benchmarks/raw/master/data/
FILES=(1million.schema 1million.rdf.gz)

for FILE in ${FILES[*]}; do
  curl --silent --location --remote-name $PREFIX/$FILE
done

popd

Then I ran this process on alpha 0, 1, 2.

NUM=0
REL="pge"

kubectl cp ./1million/ $REL-dgraph-alpha-$NUM:/dgraph -c $REL-dgraph-alpha-init
kubectl exec -ti $REL-dgraph-alpha-$NUM -c $REL-dgraph-alpha-init -- bash

## inside initContainer
REL="pge"
dgraph bulk \
 --files /dgraph/1million/1million.rdf.gz \
 --schema /dgraph/1million/1million.schema \
 --zero $REL-dgraph-zero-0.$REL-dgraph-zero-headless.default.svc.cluster.local:5080

mv /dgraph/out/0/p /dgraph
touch doneinit

lukaszlenart · November 24, 2020, 7:50am

Thanks a lot @joaquin! Just one question: can I run bulk loader on Alphas? I thought I need to do it on Zeros’ leader and the copy/paste “0” to all the Alphas.

lukaszlenart · November 24, 2020, 7:58am

Hm… you run bulk loader in a dedicated init container which is just a Dgraph … interesting

lukaszlenart · November 24, 2020, 8:14am

One more question: is this Chart officially released?

lukaszlenart · November 24, 2020, 12:33pm

Tested and it works, osm!

joaquin · November 24, 2020, 5:36pm

Yes this chart is out there in the public domain.

dgraph 0.0.15 · helm/dgraph

I didn’t announce the initContainer feature yet, as the interface will change from alpha.initContainers.generic.enabled to alpha.initContainers.init.enabled. I will also add further automation for specialized initContainers, such as offline restore and bulkloader, but not sure if these two will make it to the next chart 0.0.13.

lukaszlenart · November 24, 2020, 5:56pm

Ach… I tried with --set alpha.initContainers.init.enabled=true and that’s why it didn’t work, thanks a lot!

joaquin · November 24, 2020, 6:00pm

On this question, bulk loader can run anywhere, but it does need to connect to one of the Dgraph Zero nodes for the process to get timestamp generation. The zero leader is not needed, as members are equal partners and the leadership is elected (elected leader dependent on availability). This is part of the Raft consensus algorithm: https://raft.github.io/.

The output (./out) that has the p directory (for 1 shard cluster) will need to be copied to each Dgraph Alpha node before it starts. Thus it should be possible to do it on one system, and copy the same p directory on each of the Dgraph Alpha nodes before they start.

I haven’t tried that exact process yet, as was following pattern how it would automated within Kubernetes (ala immutable infra style) for this.

lukaszlenart · November 24, 2020, 7:02pm

Thanks for the clarification. Does it mean I cannot run bulk loader on each Alpha and I must copy the p folder created during the first import to the rest of the Alphas?

Topic		Replies	Views
Alpha node restart failed Dgraph dgraph	12	1657	February 15, 2024
Cluster not being made in dgraph Users	10	757	February 12, 2020
Production instance is taking entire load for cluster Users	8	715	November 21, 2019
Kubernetes HA -> Alpha CrashLoopBackOff Users status:accepted , kind:bug	7	1164	April 26, 2019
Serving bulk-loaded data (HA cluster) Dgraph kind:question	13	774	May 13, 2021

Alpha is failing on start

Get the Chart

Copy Data to InitContainer

Related topics