Bulk loader location

K8S version: 1.20.7
K8S Configuration: 1 Zero node, 3 Alpha nodes

Ref: https://dgraph.io/docs/deploy/fast-data-loading/bulk-loader/
Above link talks about bulk/live loader.

Few newbie questions…

  • Where is this dgraph bulk/Live tool located?
  • From where do we run this tool?
  • Do we run this tool on Zero node(s)?
  • Can we run this tool from non-dgraph cluster node?

Is part of Dgraph Binary.

Anywhere, at least the versions should be the same.

The Bulk is an offline tool. But it needs a Zero group to lease UIDs. All Alphas should be shutdown. For bulk. But for Live Loader, you need the whole cluster running. And the Live also needs to lease UIDs from the Zero group. And write the data in the Alpha groups.

Not sure what you mean. But you can run Dgraph’s binary anywhere. But it needs to reach the cluster from outside if the case. And also have the same version of the cluster.

@MichelDiz thanks for the reply.

  • We are using v21.03.1 docker container on AKS
  • In my Dgraph K8S cluster, below is what I see
  • In which folder does dgraph bulk/live tool exists?
  • Can the dgraph bulk tool load from Azure Storage account?

Docker or Kubernetes???

None of them. The bulk is part of the binary. If for some reason you wanna know where is the binary. Hit in your terminal “which dgraph” and it will give something like /usr/bin/dgraph or something.

Dgraph Bulk can run in any Linux Distro. If the OS has access to your Storage, so yes it can.

This thread has become very confused.

My workflow on k8s for using bulk loader:

  1. bring up 3 zeros
  2. bring up one alpha and have it block in the init container (part of the helm chart)
  3. exec into init container on the single alpha that is up, and run dgraph bulk <flags>
  4. after the bulk loader finishes, bring up remaining alpha node pods (they also stall in init container)
  5. distribute the out/N/p/ directories from the first node to their respective group members and place them in /dgraph/p
  6. touch doneinit in every alpha to unblock the init container and it will start up with all of your data

Wow! this is even more confusing.

As per the official docs bulk uploaded must be run from Zero and all Alphas should be down.

Does it work for you?

It does not have to be run from the zeros, it has to access the zeros while it’s running (to allocate uids).

You could run the bulk loader from your laptop with a port-forward to the zero leader if you want. The reason I do it in the init container of one alpha is to use the 16c 64GiB ram is has. (we have like 30billion edges in my dgraph cluster now)

But yes, the alphas should be down - otherwise they will do stuff with the zeros while the bulk loader does.

1 Like
  • Does bulk loader usage involve bootstrapping dgraph with millions/billions of nodes and edges?

Well that is it’s job, so yes. It’s not magic, it just reads RDF files and writes them to badger the same way dgraph reads from badger.

Also since you did not get an answer to this:

The bulk loader can read from minio, and minio can passthrough to azure blob store.