Build Kafka Connector for Dgraph in Live and Bulk Loader

Moved from GitHub dgraph/3967

Posted by mangalaman93:

This will allow loading data directly from Kafka.

Willem520 commented :

great idea,I hoop Dgraph can has a close connection with process engine(eg. Flink, Spark) in the near future

campoy commented :

Hey @Willem520,

Could you tell us more about what you would expect from these integrations with Flink or Spark?

Willem520 commented :

Hey @Willem520,

Could you tell us more about what you would expect from these integrations with Flink or Spark?

hello, in my project,I want to use Flink or Sparking streaming to process Rdf or json data in realtime, and I need to transport history data from other graph database(eg janusGraph) to Dgraph.but I found, when I used the spark and dgraph4j to process large dataset(eg 5 million node),it was always failed, and sometime, there was breakdown in alpha.

campoy commented :

I’m sorry but I’m going to need more information on what you were actually building and how it failed.

If I understand correctly, you’re processing a stream of events in RDF or JSON format?
Or is it a batch analysis with 5 million nodes?

What exact API would you like us to provide to integrate with Spark or Flink?

Willem520 commented :

hi,I used Spark to load 5 million node into memory and used 100 partitions to process data, in each partition, I build 2000 node with JSON format into a mutation,an used dgraph4j client to execute txn.mutate. when I run the program,it was failed, and got the error message
image
if I used a small dataset(eg 500000 node) in the same program, it was successed.

mangalaman93 commented :

How many cores are you providing to each executor? How many executors are you running concurrently? You could try reducing the size of each transaction so that it finishes quickly and total number of pending transactions are fewer.

Willem520 commented :

I used 4executor-cores,5num-executors.I needed to import at least 100 million data to Dgraph

AshNL commented :

Not directly related to Dgraph, but Neo4j just announced a new product which will tightly integrate Neo4j with Kafka. I feel like this is a feature which might greatly impact DB choice for (new) projects. https://www.datanami.com/2019/10/01/neo4j-gets-hooks-into-kafka/

marvin-hansen commented :

@AshNL Have you ever used neo4j in your entire life?

We did for ~3 months and actually migrating everything away from it to save our sanity and company. I cannot remember any other database that was causing more operational problems, more concurrency issues, and consistently terrible performance. The most mind-boggling thing is, the company indeed listens to all reported problems, but they never fixed anything…

Meanwhile, we run the most mission-critical stuff on Postgres. We de-normalized those few tables to operate entirely join-free to sustain very high performance.

With DGraph, there are a few rough edges because its relatively new, but for the most part, when it runs, it just runs.

For the aforementioned Kafka connector, there are tutorials out of how to write one. I think implementing the connector with a queue and proper batch-writing should do the trick.

AshNL commented :

No need to start biting. I’m sorry I’m not as experienced as you are. In the meantime I have indeed written my own connector.