Bulk loading data from Dask


(Sandeep Srinivasa) #1

hi guys,
we are running a large ETL process in Dask to pre-process tons of files.
This is an hourly process and is incremental (so we append to the graph already built last hour).

How does one achieve this ? it seems that the bulk loaders are all commandline based. Is there any other way to bulk load data into Dgraph from another system (in python )


(Michel Conrado) #2

We have a Python client, you can use it to create a kind of bridge between your processes.

Live and Bulk loader are command line based indeed. But there’s ways not only in py that you can use these commands in a program. Like “os”, “delegator” modules. JavaScript also there’s a tons of solution related to this.

BTW, keep in mind that the bulk loader is only used to populate data for the first time. The live loader is for a running cluster. And Live loader is technically a program that uses a Dgraph client (Dgo client). Isn’t so special. Only Bulk Load is different.


(Sandeep Srinivasa) #3

Hi Michael
Thanks for replying.

Two points

  1. Your python SDK documentation does not show how to do this at scale. For example if I had 10 million records. That’s the issue.

  2. I’m a bit confused from second part of your answer where you reference go. Is using golang mandatory? Because our whole infrastructure is in python.

Would you have any example of doing bulk loads through Python into dgraph ?


(Michel Conrado) #4

OK, I’ll check that.

No, I was only mentioning that the tool “Live loader” uses a client. So you can use a client to do the same work.

Nope, I don’t program in py. But I have had adventures with command lines (Bash/unix) within a python program. Then it is possible. In fact anything is possible, just planning.

In your case, I would export the data to JSON or RDF and inject via liveloader. Using Python just to automate the data injection process.

If you are interested in better understanding how a migration process would work (which would look like what you intend to do) we have this post here https://blog.dgraph.io/post/migrating-from-sql-to-dgraph /


(Michel Conrado) #5

About that, I had not observed the question well. But the answer for that is simple.

When you make the program in py you need to create a specific logic for batches (you can check the Live loader logic as inspiration). You have to split your millions of records into smaller pieces. Something like 1k triples per transaction is ideal.

You can also create logic that balances mutations between each node in your cluster. This would help lessen the stress of the cluster as a whole.