Bulk load to initial multi host cluster


#1

Hi, I load data into a initial cluster(1 zero and 3 alpha in multi host native ) by dgraph bulk ,and then i can find the schema and data from ratel,but the zero node dosen’t rebalance ,all the data is only on the original load node, what should i do in this case?
zero parameters:
./dgraph zero --idx 2 --my:IP:PORT --replicas 3 --telemetry
bulk load parameters:
–reduce_shards 1 --map_shards 3


(Michel Conrado) #2

Dgraph has its own rules of balancing. It will do it as soon as it is needed. But check in http://localhost:6080/state which groups are serving the predicates.


#3

Fine, Does it mean if i bulk load data into node with a large disk capacity ,and the dataset is not enough large to trigger the rebalancing rules , all the data will on the original load node no matter what i do?


(Michel Conrado) #4

Please, check https://docs.dgraph.io/deploy/#understanding-dgraph-cluster
In the part “Shard rebalancing”.

You can also try to push tablets to other groups
https://docs.dgraph.io/deploy/#more-about-dgraph-zero

  • /moveTablet?tablet=name&group=2 This endpoint can be used to move a tablet to a group. Zero already does shard rebalancing every 8 mins, this endpoint can be used to force move a tablet.

#5

Thanks for your reply.
Another thing,if i have a massive datasets (may be more than 1TB and about 1 billion edge) which need to bulk load into the cluster, Do I have to load them in one Node? Can I split the datasets to other zero node to bulk load at the same time?
I only find one blog about the bulk load details : Loading close to 1M edges/sec into Dgraph , and official documents deploy/#bulk-loader doesn’t mention the multi-zero instance case.
Could you provide more detail about interaction mechanism between zero instance when bulk load in multi-zero instances case


(Daniel Mai) #6

Currently bulk loader runs only on a single machine. In the soon-to-be-released Dgraph v1.1 we optimized both the live loader and bulk loader—in our own tests we’ve seen bulk loader peak to 4 million edges/sec.

A multi-Zero setup does not make a difference for bulk loading. Bulk loader (must) connect to the Zero leader to assign UIDs to the nodes in the cluster. Adding more Zero instances doesn’t make the loading process any faster. Most of the work is done by bulk loader itself, not by Zero.

Bulk loader is highly concurrent, so more CPU cores would definitely help speed up the bulk loading process.


Dgraph Enhancement Proposal: bulk + live loader?
#7

Very useful to solve my problem thanks you !


(system) closed #8

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.