Can dgraph use hdfs?

Willem520 · May 21, 2020, 9:13am

Hi,I have an idea.Now, many data store in hive. if Dgraph bulk could resolve data on hive or hdfs,it looks like so cool

Anurag · May 21, 2020, 10:28am

Do you mean if we could expose options in dgraph bulk loader which can ingest data directly from hdfs and other similar big data stores?

Willem520 · May 21, 2020, 10:47am

yes, that’s right. it would be more convenient

Anurag · May 21, 2020, 11:03am

That is a good idea. I do see some of the other databases giving similar APIs. It can be explored when we decide on distributed bulk loader as mentioned in Scalability, performance, and reliability section of our Q1/Q2 roadmap for 2020. Would you mind adding a requested feature in the comments below roadmap? Based on community interest we can try to prioritize it.

MichelDiz · May 21, 2020, 4:35pm

In addition to what is being discussed, the Bulk loader is made to run once. If you need multiple loads in a functioning cluster, you must use live loader.

If that were the case, in theory, you would have to stop the entire cluster (leaving only Zero on) to do this procedure.

Willem520 · May 22, 2020, 2:43am

not only bulk loader, the same as live loader. they both need assign --files and --schema which the file must be stored at the local. in my case, my graph data is stored in hive.so if I want to load data into dgraph. I must transform date from hive to file. and if the file is not stored with dgraph.I also have to transfer the file to the specified machine. it took too much time

MichelDiz · May 22, 2020, 2:45am

What is the format of hdfs output?

Willem520 · May 22, 2020, 2:52am

rdf or json format is ok. because I can use spark to transform data. I want to use bulk like “dgraph bulk --files hdfs_path/rdf --schema hdfs_path/schema”

MichelDiz · May 22, 2020, 3:51am

Hum, is that data too big? cuz you could create a small program that injects the data as JSON (via a client) on demand.

Perhaps this is the simplest solution at the moment. If hive or hdfs can export as JSON, that would feasible to create a small program like “driver” that does it “live”.

I don’t know hive, but it seems to me that he is SQL like. Does it have graph support?

Willem520 · May 22, 2020, 3:57am

the dataset is more than 100 million. it cost too much time on transforming.

MichelDiz · May 22, 2020, 4:17am

Yeah, but that’s the problem. You can use bulk loader only once. You can’t use it as a pipeline. In theory you can shutdown the cluster to a new load but we haven’t tested if it is safe. And it is a trade-off, bulk millions you need to shutdown. If you need the cluster running during this procedure, you have to use a client or live loader.

So, there’s no way to generate small JSON chunks? and do the load in a procedural way?

Willem520 · May 22, 2020, 4:20am

yes,I have no idea about it.

MichelDiz · May 22, 2020, 4:38am

I see that you have commented about Spark some time ago Build Kafka Connector for Dgraph in Live and Bulk Loader · Issue #3967 · dgraph-io/dgraph · GitHub

There is an issue for this?

If you can use Spark as a middle way to Dgraph, would it be good? BTW, I have also never used Spark. But feels like it can throw JSON.

UPDATE: I have found it

github.com/dgraph-io/dgraph

Spark connector for Dgraph

opened 05:32PM - 13 Feb 18 UTC

closed 04:53AM - 21 Jul 20 UTC

jimanvlad

kind/feature popular status/accepted area/integrations

Hi, It would be great if, at some point, a Spark connector was to be created. …Neo4J uses its connector both for preprocessing and loading data in, but also to get sub-graphs out for processing in Spark. https://neo4j.com/developer/apache-spark/ https://spark.apache.org/docs/0.9.0/graphx-programming-guide.html Vlad

Willem520 · May 22, 2020, 7:24am

yes, I used Spark as a middle way to Dgraph.but Dgraph could not do well by using mutate to insert or upsert data,because the dataset is too large

JimWen · May 25, 2020, 3:09am

That’s a good idea to support big data frame like hadoop/spark/flink cause dgraph is usally used on a large dataset while neo4j is used on small ones. And we are importing data online using flink through the dgraph java interface. @Willem520

@MichelDiz @Anurag
The big problem of bulkload now is that it often OOM on large data set which is unable to load into mem once.

And i have an idea, what if we split the map or the reduce phase to many steps–processing only a fews files one step with the limited mem. Then with many steps，it can still work on large dataset just like how distribute does and the bottleneck is disk not mem.

It may be a little slower, but it can solve many problem with less modify of current code.
How do you think?

Willem520 · May 25, 2020, 4:30am

yes, that is what I want to say.@JimWen

BlankRain · June 1, 2020, 7:25am

dgraph is not ready for big data.

Topic		Replies	Views
Can dgraph bulk loader use google cloud storage Dgraph	1	321	June 13, 2020
Dgraph Live loader In Spark Job Dgraph kind:question , dgraph	6	465	January 8, 2021
Bulk Loader - Deploy Documentation	0	895	December 16, 2020
About bulk loader Users	7	1856	September 12, 2018
Improve throughput of bulk loader with distributed loading Dgraph dgraph , kind:enhancement , priority:p2 , status:accepted , popular	21	1025	February 6, 2020

Can dgraph use hdfs?

Related topics