Can dgraph use hdfs?

Hi,I have an idea.Now, many data store in hive. if Dgraph bulk could resolve data on hive or hdfs,it looks like so cool

Do you mean if we could expose options in dgraph bulk loader which can ingest data directly from hdfs and other similar big data stores?

yes, that’s right. it would be more convenient

That is a good idea. I do see some of the other databases giving similar APIs. It can be explored when we decide on distributed bulk loader as mentioned in Scalability, performance, and reliability section of our Q1/Q2 roadmap for 2020. Would you mind adding a requested feature in the comments below roadmap? Based on community interest we can try to prioritize it.

In addition to what is being discussed, the Bulk loader is made to run once. If you need multiple loads in a functioning cluster, you must use live loader.

If that were the case, in theory, you would have to stop the entire cluster (leaving only Zero on) to do this procedure.

not only bulk loader, the same as live loader. they both need assign --files and --schema which the file must be stored at the local. in my case, my graph data is stored in hive.so if I want to load data into dgraph. I must transform date from hive to file. and if the file is not stored with dgraph.I also have to transfer the file to the specified machine. it took too much time

What is the format of hdfs output?

rdf or json format is ok. because I can use spark to transform data. I want to use bulk like “dgraph bulk --files hdfs_path/rdf --schema hdfs_path/schema”

Hum, is that data too big? cuz you could create a small program that injects the data as JSON (via a client) on demand.

Perhaps this is the simplest solution at the moment. If hive or hdfs can export as JSON, that would feasible to create a small program like “driver” that does it “live”.

I don’t know hive, but it seems to me that he is SQL like. Does it have graph support?

the dataset is more than 100 million. it cost too much time on transforming.

Yeah, but that’s the problem. You can use bulk loader only once. You can’t use it as a pipeline. In theory you can shutdown the cluster to a new load but we haven’t tested if it is safe. And it is a trade-off, bulk millions you need to shutdown. If you need the cluster running during this procedure, you have to use a client or live loader.

So, there’s no way to generate small JSON chunks? and do the load in a procedural way?

yes,I have no idea about it.

I see that you have commented about Spark some time ago https://github.com/dgraph-io/dgraph/issues/3967#issuecomment-530740105

There is an issue for this?

If you can use Spark as a middle way to Dgraph, would it be good? BTW, I have also never used Spark. But feels like it can throw JSON.

UPDATE: I have found it

yes, I used Spark as a middle way to Dgraph.but Dgraph could not do well by using mutate to insert or upsert data,because the dataset is too large

That’s a good idea to support big data frame like hadoop/spark/flink cause dgraph is usally used on a large dataset while neo4j is used on small ones. And we are importing data online using flink through the dgraph java interface. @Willem520

@MichelDiz @alvis
The big problem of bulkload now is that it often OOM on large data set which is unable to load into mem once.

And i have an idea, what if we split the map or the reduce phase to many steps–processing only a fews files one step with the limited mem. Then with many steps,it can still work on large dataset just like how distribute does and the bottleneck is disk not mem.

It may be a little slower, but it can solve many problem with less modify of current code.
How do you think?

yes, that is what I want to say.@JimWen

dgraph is not ready for big data.