Serving bulk-loaded data (HA cluster)

I have bulk-loaded (offline) data into dgraph and then want to serve the data by pointing an alpha to the out/0 directory. All is fine as the alpha picks up the data and serves them. But the zero has a maxUID of 0.

Fetching /state endpoint returns:

{
  "counter": "xxx",
  "groups": {
    "1": {
      "members": {
        "1": {
          "id": "1",
          "groupId": 1,
          "addr": "dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080",
          "leader": true,
          "amDead": false,
          "lastUpdate": "1620734135",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        }
      },
      "tablets": {
        "PREDICATE1": {
          "groupId": 1,
          "predicate": "PREDICATE1",
          "force": false,
          "onDiskBytes": "0",
          "remove": false,
          "readOnly": false,
          "moveTs": "0",
          "uncompressedBytes": "0"
        },
        "PREDICATE2": {
          "groupId": 1,
          "predicate": "PREDICATE2",
          "force": false,
          "onDiskBytes": "0",
          "remove": false,
          "readOnly": false,
          "moveTs": "0",
          "uncompressedBytes": "0"
        }
      },
      "snapshotTs": "0",
      "checksum": "17033103070337915579",
      "checkpointTs": "0"
    }
  },
  "zeros": {
    "1": {
      "id": "1",
      "groupId": 0,
      "addr": "dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080",
      "leader": true,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    },
    "2": {
      "id": "2",
      "groupId": 0,
      "addr": "dgraph-zero-1.dgraph-zero.default.svc.cluster.local:5080",
      "leader": false,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    },
    "3": {
      "id": "3",
      "groupId": 0,
      "addr": "dgraph-zero-2.dgraph-zero.default.svc.cluster.local:5080",
      "leader": false,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    }
  },
  "maxUID": "0",
  "maxTxnTs": "0",
  "maxNsID": "0",
  "maxRaftId": "1",
  "removed": [],
  "cid": "a86ce955-298c-47dc-af22-a3e7884cc023",
  "license": {
    "user": "",
    "maxNodes": "18446744073709551615",
    "expiryTs": "1623325664",
    "enabled": true
  }
}

When I point the zero node to the zw directory that was created by the bulk-loader, then I get the expected MaxUID:

I0511 13:49:29.169066      21 assign.go:47] Updated UID: 179410001. Txn Ts: 10001. NsID: 1.

But then, zero nodes ignore the --peer option and all of them connect to localhost:5080. So the three zero instances do not form a cluster. And the alpha nodes cannot connect to the zero anymore:

I0511 13:50:27.486524      17 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
I0511 13:50:27.497629      17 pool.go:162] CONNECTING to localhost:5080
W0511 13:50:27.498170      17 pool.go:267] Connection lost with localhost:5080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0511 13:50:28.390083      17 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>

What is the approved way to serve a bulk-loaded dataset with a dgraph-ha.yaml cluster?

Using K8s you have to start the Zeros instances, put the Alphas to wait and do the bulkload. You have to always use the same Zero. Never delete the Zero folders.

So you say I have to bulk load against the three zeros, if I want to serve them with three zeros?

I bulk-loaded them with a single zero and now want to serve them with three zeros (–replicas 3).

Or to simplify, I bulk load with one zero and now want to serve the data with one zero (–replicas 1) and multiple alphas all serving their own copy of out/0.

The replicas don’t matter. But if you gonna have replication, there are configs to do in Bulk loader. In general, the Zero doesn’t need to have replica config during a bulk, but nothing implies against it.

Can you point me to it, please?

Even without replication, the alpha does not get added to the cluster.

I start a single zero with the zw directory from bulk-loading:

dgraph zero --cwd /dgraph/zero --my=dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080

Then I launch a single alpha with the out/0 directory from bulk-loading:

dgraph alpha --cwd /dgraph/alpha/out/0 --my=dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080 --zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080

It connects to the zero node and then loops over these errors:

I0511 15:33:16.071007      19 pool.go:162] CONNECTING to dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080
I0511 15:33:16.081488      19 pool.go:162] CONNECTING to localhost:5080
W0511 15:33:16.083147      19 pool.go:267] Connection lost with localhost:5080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0511 15:33:16.969016      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E0511 15:33:17.969733      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E0511 15:33:18.970415      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
E0511 15:33:19.971068      19 groups.go:1177] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
I0511 15:33:20.968390      19 admin.go:824] Error reading GraphQL schema: Please retry again, server is not ready to accept requests.

Zero’s state endpoint tells me:

{
  "counter": "1636",
  "groups": {},
  "zeros": {
    "1": {
      "id": "1",
      "groupId": 0,
      "addr": "localhost:5080",
      "leader": true,
      "amDead": false,
      "lastUpdate": "0",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    }
  },
  "maxUID": "179410000",
  "maxTxnTs": "10000",
  "maxNsID": "0",
  "maxRaftId": "0",
  "removed": [],
  "cid": "7c878cbe-5a65-4ef9-87c9-8d4bdbe2b202",
  "license": {
    "user": "",
    "maxNodes": "18446744073709551615",
    "expiryTs": "1622879986",
    "enabled": true
  }
}

If you are talking about --reduce_shards, then this is fine as I am planning to use only one group. I could not find any other replica-related settings on https://dgraph.io/docs/deploy/fast-data-loading/bulk-loader/.

This looks like a previous remaining config. You have to cleanup your cluster before doing it. Never do a Bulkload with previous files in the volume. And make sure to make the env the same as possible. Like addr should always be the same you have started.

Is this particular "addr": "localhost:5080" a problem in itself or just an indication that the setup was not clean, which worries you this might have other side effects?

To give you some more background. The bulk loading was done in a single docker container (not kubernetes), hosting zero and alpha together:

rm -rf /dgraph/zw
dgraph zero >> /data/zero.log 2>&1 < /dev/null &
sleep 5

rm -rf /dgraph/out /dgraph/xidmap
dgraph bulk --store_xids --xidmap /dgraph/xidmap -j 4 --ignore_errors --tmp /dgraph/tmp -f "/data/data.rdf" -s "/data/schema.rdf" --format=rdf --out=/dgraph/out --replace_out 2>&1 | tee /dgraph/bulk.log

This probably caused the "addr": "localhost:5080". As you can see, both zero and alpha had empty directories.

85% of chance of being what I said. I have seen this sometimes. Dgraph stores the configs in Stone. So, remaining configs can still be there and not override.

But why did you shared a k8s YAML?

Make sure they use the SVC address always. In case you use K8s. Or the Docker’s network context.

Zero configs are only stored in zw directory, right?

So I have bulk-loaded against localhost:5080 zero and now want to serve that zero under a different hostname, a SVC address. Is that repurposing possible?

Can I “rename” zeros or do I have to “migrate” the zero cluster away from that node, like adding other zeros and removing the first one?

Following my own experience, you can’t. You have to start the zero with the final address from the beginning.

The last idea feels plausible. But I’m not sure. But It really feels possible, as you are killing the original zero after passing the bat to other. Give it a shot.

1 Like

I can confirm that bulk-loading against a zero that has the same prospect hostname as the serving instance resolves all issues. This specific requirement during bulk-loading is a bit unexpected. It should be strongly emphasized in the bulk loading section of the documentation.

Maybe it is mentioned there, but reading through it multiple times didn’t prevent me from learning this the hard way, wasting many days of trail and error. Definitively worth improvement.

Many thanks for your quick support, highly appreciated!

1 Like