To facilitate better answering of questions, if you have a question, please fill in the following questions. Otherwise, please delete the template.
I Want to Do
I have a huge amount of data which is being reside in Hive tables , I need to load spark transformation to Dgraph using live loader.
What I Did
I am reading data in parallel using spark and building json file to use live loader and using that json file direclty in Dgraph cluster in live loader but i want to use the Dgraph live loader in Spark Job.
Hi @dharm0795
Live loader cannot be run via a spark job. You could create a rdf file via the spark job and then invoke live loader in the usual manner.
@anand That way be we have to run it on the same machine where Dgraph is running which can cause a performance issue.
Just want to understand how live loader mutate in parallel internally .
@anand How much of Parallelism can we achieve using dgraph java client for Upsert Queries ? will it scale to upload billion of entries using Dgraph Java client?
Hi @dharm0795,
You can of course write up a parallel java client. But please note the following.
You will have to tune concurrency and batch size.
You will have to manage retries in case of temporary errors (network, or Dgraph server too busy or any other)
If you are planning to run a java client via a spark job, you will have to consider how to implement points 1 and 2 in the context of a spark job (what if a job dies?). I am not sure what are the best practices to include and pitfalls to avoid here.
IMO, let spark do its thing and spit out a file. Live loader handles point 1 and 2 above and has configuration options available. We have heavily tuned aspects of memory in v20.11 across alphas as well as live and bulk load tools, and you will be benefited by this in your use case.
I use Benthos to process message bus messages and insert them into dgraph. (via a custom written output using the binary interface in dgo). Results are pretty good, 100k rdf messages/s. This can be distributed which can cause ErrTxnAbort errors, but those just have to be retried. Getting the right balance of #of pods vs Abort error rate is important for my throughput.
Using distributed locks can alleviate the aborts, but that slowed me down just as much so I removed them.