Serving bulk-loaded data (HA cluster)

EnricoMi · May 11, 2021, 1:57pm

I have bulk-loaded (offline) data into dgraph and then want to serve the data by pointing an alpha to the out/0 directory. All is fine as the alpha picks up the data and serves them. But the zero has a maxUID of 0.

Fetching /state endpoint returns:

{
  "counter": "xxx",
  "groups": {
    "1": {
      "members": {
        "1": {
          "id": "1",
          "groupId": 1,
          "addr": "dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080",
          "leader": true,
          "amDead": false,
          "lastUpdate": "1620734135",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        }
      },
      "tablets": {
        "PREDICATE1": {
          "groupId": 1,
          "predicate": "PREDICATE1",
          "force": false,
          "onDiskBytes": "0",
          "remove": false,
          "readOnly": false,
          "moveTs": "0",
          "uncompressedBytes": "0"
        },
        "PREDICATE2": {
          "groupId": 1,
          "predicate": "PREDICATE2",
          "force": false,
          "onDiskBytes": "0",
          "remove": false,
          "readOnly": false,
          "moveTs": "0",
          "uncompressedBytes": "0"
        }
      },
      "snapshotTs": "0",
      "checksum": "17033103070337915579",
      "checkpointTs": "0"
    }
  },
  "zeros": {
    "1": {
      "id": "1",
      "groupId": 0,
      "addr": "dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080",
      "leader": true,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    },
    "2": {
      "id": "2",
      "groupId": 0,
      "addr": "dgraph-zero-1.dgraph-zero.default.svc.cluster.local:5080",
      "leader": false,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    },
    "3": {
      "id": "3",
      "groupId": 0,
      "addr": "dgraph-zero-2.dgraph-zero.default.svc.cluster.local:5080",
      "leader": false,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    }
  },
  "maxUID": "0",
  "maxTxnTs": "0",
  "maxNsID": "0",
  "maxRaftId": "1",
  "removed": [],
  "cid": "a86ce955-298c-47dc-af22-a3e7884cc023",
  "license": {
    "user": "",
    "maxNodes": "18446744073709551615",
    "expiryTs": "1623325664",
    "enabled": true
  }
}

When I point the zero node to the zw directory that was created by the bulk-loader, then I get the expected MaxUID:

I0511 13:49:29.169066      21 assign.go:47] Updated UID: 179410001. Txn Ts: 10001. NsID: 1.

But then, zero nodes ignore the --peer option and all of them connect to localhost:5080. So the three zero instances do not form a cluster. And the alpha nodes cannot connect to the zero anymore:

I0511 13:50:27.486524      17 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
I0511 13:50:27.497629      17 pool.go:162] CONNECTING to localhost:5080
W0511 13:50:27.498170      17 pool.go:267] Connection lost with localhost:5080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0511 13:50:28.390083      17 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>

What is the approved way to serve a bulk-loaded dataset with a dgraph-ha.yaml cluster?

MichelDiz · May 11, 2021, 2:01pm

Using K8s you have to start the Zeros instances, put the Alphas to wait and do the bulkload. You have to always use the same Zero. Never delete the Zero folders.

EnricoMi · May 11, 2021, 2:05pm

So you say I have to bulk load against the three zeros, if I want to serve them with three zeros?

I bulk-loaded them with a single zero and now want to serve them with three zeros (–replicas 3).

EnricoMi · May 11, 2021, 2:07pm

Or to simplify, I bulk load with one zero and now want to serve the data with one zero (–replicas 1) and multiple alphas all serving their own copy of out/0.

MichelDiz · May 11, 2021, 2:21pm

The replicas don’t matter. But if you gonna have replication, there are configs to do in Bulk loader. In general, the Zero doesn’t need to have replica config during a bulk, but nothing implies against it.

EnricoMi · May 11, 2021, 3:28pm

Can you point me to it, please?

EnricoMi · May 11, 2021, 3:37pm

Even without replication, the alpha does not get added to the cluster.

I start a single zero with the zw directory from bulk-loading:

dgraph zero --cwd /dgraph/zero --my=dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080

Then I launch a single alpha with the out/0 directory from bulk-loading:

dgraph alpha --cwd /dgraph/alpha/out/0 --my=dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080 --zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080

It connects to the zero node and then loops over these errors:

I0511 15:33:16.071007      19 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
I0511 15:33:16.081488      19 pool.go:162] CONNECTING to localhost:5080
W0511 15:33:16.083147      19 pool.go:267] Connection lost with localhost:5080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0511 15:33:16.969016      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E0511 15:33:17.969733      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E0511 15:33:18.970415      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E0511 15:33:19.971068      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0511 15:33:20.968390      19 admin.go:824] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.

Zero’s state endpoint tells me:

{
  "counter": "1636",
  "groups": {},
  "zeros": {
    "1": {
      "id": "1",
      "groupId": 0,
      "addr": "localhost:5080",
      "leader": true,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    }
  },
  "maxUID": "179410000",
  "maxTxnTs": "10000",
  "maxNsID": "0",
  "maxRaftId": "0",
  "removed": [],
  "cid": "7c878cbe-5a65-4ef9-87c9-8d4bdbe2b202",
  "license": {
    "user": "",
    "maxNodes": "18446744073709551615",
    "expiryTs": "1622879986",
    "enabled": true
  }
}

EnricoMi · May 11, 2021, 4:25pm

If you are talking about --reduce_shards, then this is fine as I am planning to use only one group. I could not find any other replica-related settings on https://dgraph.io/docs/deploy/fast-data-loading/bulk-loader/.

MichelDiz · May 11, 2021, 6:31pm

This looks like a previous remaining config. You have to cleanup your cluster before doing it. Never do a Bulkload with previous files in the volume. And make sure to make the env the same as possible. Like addr should always be the same you have started.

EnricoMi · May 11, 2021, 6:58pm

Is this particular "addr": "localhost:5080" a problem in itself or just an indication that the setup was not clean, which worries you this might have other side effects?

To give you some more background. The bulk loading was done in a single docker container (not kubernetes), hosting zero and alpha together:

rm -rf /dgraph/zw
dgraph zero >> /data/zero.log 2>&1 < /dev/null &
sleep 5

rm -rf /dgraph/out /dgraph/xidmap
dgraph bulk --store_xids --xidmap /dgraph/xidmap -j 4 --ignore_errors --tmp /dgraph/tmp -f "/data/data.rdf" -s "/data/schema.rdf" --format=rdf --out=/dgraph/out --replace_out 2>&1 | tee /dgraph/bulk.log

This probably caused the "addr": "localhost:5080". As you can see, both zero and alpha had empty directories.

MichelDiz · May 12, 2021, 1:49am

85% of chance of being what I said. I have seen this sometimes. Dgraph stores the configs in Stone. So, remaining configs can still be there and not override.

But why did you shared a k8s YAML?

Make sure they use the SVC address always. In case you use K8s. Or the Docker’s network context.

EnricoMi · May 12, 2021, 12:35pm

Zero configs are only stored in zw directory, right?

So I have bulk-loaded against localhost:5080 zero and now want to serve that zero under a different hostname, a SVC address. Is that repurposing possible?

Can I “rename” zeros or do I have to “migrate” the zero cluster away from that node, like adding other zeros and removing the first one?

MichelDiz · May 12, 2021, 8:46pm

Following my own experience, you can’t. You have to start the zero with the final address from the beginning.

The last idea feels plausible. But I’m not sure. But It really feels possible, as you are killing the original zero after passing the bat to other. Give it a shot.

EnricoMi · May 13, 2021, 11:26am

I can confirm that bulk-loading against a zero that has the same prospect hostname as the serving instance resolves all issues. This specific requirement during bulk-loading is a bit unexpected. It should be strongly emphasized in the bulk loading section of the documentation.

Maybe it is mentioned there, but reading through it multiple times didn’t prevent me from learning this the hard way, wasting many days of trail and error. Definitively worth improvement.

Many thanks for your quick support, highly appreciated!

Topic		Replies	Views
Dgraph Bulk load on version 20.07.02 Dgraph kind:question , bulkloader	3	900	November 9, 2020
Odd behaviour when running zero and alpha of 1.0.14 on the same localhost with --port-offset Dgraph	5	851	April 26, 2019
Unable to load bulk loaded data into Dgraph Users	4	663	March 21, 2019
Production instance is taking entire load for cluster Users	8	631	November 21, 2019
Bulk loader: What am I doing wrong, my data doesn't show up Dgraph dgraph , area:bulk-loader	0	524	February 6, 2022

Serving bulk-loaded data (HA cluster)

Related Topics