Bulk loading data from Dask

sandys · August 3, 2019, 3:06pm

hi guys,
we are running a large ETL process in Dask to pre-process tons of files.
This is an hourly process and is incremental (so we append to the graph already built last hour).

How does one achieve this ? it seems that the bulk loaders are all commandline based. Is there any other way to bulk load data into Dgraph from another system (in python )

MichelDiz · August 3, 2019, 6:28pm

We have a Python client, you can use it to create a kind of bridge between your processes.

Live and Bulk loader are command line based indeed. But there’s ways not only in py that you can use these commands in a program. Like “os”, “delegator” modules. JavaScript also there’s a tons of solution related to this.

BTW, keep in mind that the bulk loader is only used to populate data for the first time. The live loader is for a running cluster. And Live loader is technically a program that uses a Dgraph client (Dgo client). Isn’t so special. Only Bulk Load is different.

sandys · August 3, 2019, 6:37pm

Hi Michael
Thanks for replying.

Two points

Your python SDK documentation does not show how to do this at scale. For example if I had 10 million records. That’s the issue.
I’m a bit confused from second part of your answer where you reference go. Is using golang mandatory? Because our whole infrastructure is in python.

Would you have any example of doing bulk loads through Python into dgraph ?

MichelDiz · August 3, 2019, 7:02pm

OK, I’ll check that.

No, I was only mentioning that the tool “Live loader” uses a client. So you can use a client to do the same work.

Nope, I don’t program in py. But I have had adventures with command lines (Bash/unix) within a python program. Then it is possible. In fact anything is possible, just planning.

In your case, I would export the data to JSON or RDF and inject via liveloader. Using Python just to automate the data injection process.

If you are interested in better understanding how a migration process would work (which would look like what you intend to do) we have this post here Migrating data from SQL to Dgraph - Dgraph Blog /

MichelDiz · August 4, 2019, 1:28am

About that, I had not observed the question well. But the answer for that is simple.

When you make the program in py you need to create a specific logic for batches (you can check the Live loader logic as inspiration). You have to split your millions of records into smaller pieces. Something like 1k triples per transaction is ideal.

You can also create logic that balances mutations between each node in your cluster. This would help lessen the stress of the cluster as a whole.

Topic		Replies	Views
Dgraph live/ bulk loader golang clinet Dgraph	1	348	February 14, 2023
Bulk Loader - Deploy Documentation	0	799	December 16, 2020
How to commit transactions as batch? Dgraph Clients untagged , pydgraph	12	944	July 8, 2020
Make bulk loader code available as a package Dgraph area:bulk-loader	3	657	January 6, 2021
About bulk loader Users	7	1782	September 12, 2018

Bulk loading data from Dask

Related Topics