Dgraph Live loader In Spark Job

dharm0795 · January 8, 2021, 9:17am

To facilitate better answering of questions, if you have a question, please fill in the following questions. Otherwise, please delete the template.

I Want to Do

I have a huge amount of data which is being reside in Hive tables , I need to load spark transformation to Dgraph using live loader.

What I Did

I am reading data in parallel using spark and building json file to use live loader and using that json file direclty in Dgraph cluster in live loader but i want to use the Dgraph live loader in Spark Job.

Dgraph Metadata

dgraph version

Latest.

anand · January 8, 2021, 9:31am

Hi @dharm0795
Live loader cannot be run via a spark job. You could create a rdf file via the spark job and then invoke live loader in the usual manner.

dharm0795 · January 8, 2021, 9:44am

@anand That way be we have to run it on the same machine where Dgraph is running which can cause a performance issue.
Just want to understand how live loader mutate in parallel internally .

dharm0795 · January 8, 2021, 9:54am

@anand How much of Parallelism can we achieve using dgraph java client for Upsert Queries ? will it scale to upload billion of entries using Dgraph Java client?

anand · January 8, 2021, 10:02am

Hi @dharm0795,
You can of course write up a parallel java client. But please note the following.

You will have to tune concurrency and batch size.
You will have to manage retries in case of temporary errors (network, or Dgraph server too busy or any other)
If you are planning to run a java client via a spark job, you will have to consider how to implement points 1 and 2 in the context of a spark job (what if a job dies?). I am not sure what are the best practices to include and pitfalls to avoid here.

IMO, let spark do its thing and spit out a file. Live loader handles point 1 and 2 above and has configuration options available. We have heavily tuned aspects of memory in v20.11 across alphas as well as live and bulk load tools, and you will be benefited by this in your use case.

dharm0795 · January 8, 2021, 10:08am

@dewwrat FYI.

iluminae · January 8, 2021, 3:41pm

I use Benthos to process message bus messages and insert them into dgraph. (via a custom written output using the binary interface in dgo). Results are pretty good, 100k rdf messages/s. This can be distributed which can cause ErrTxnAbort errors, but those just have to be retried. Getting the right balance of #of pods vs Abort error rate is important for my throughput.

Using distributed locks can alleviate the aborts, but that slowed me down just as much so I removed them.

Topic		Replies	Views
Can dgraph use hdfs? Dgraph	16	707	June 1, 2020
Build Kafka Connector for Dgraph in Live and Bulk Loader Dgraph dgraph , area:integrations , kind:feature , popular , status:needs-specs	10	1231	January 16, 2020
Live Loader - Deploy Documentation	0	746	December 16, 2020
Fast Data Loading - Deploy Documentation	1	744	October 2, 2020
Spark Connector for dgraph Dgraph	27	3611	July 26, 2020

Dgraph Live loader In Spark Job

I Want to Do

What I Did

Dgraph Metadata

Related topics