Spark, GraphX, machine learning

jchiu · September 23, 2016, 4:41am

Spark is very much like Google Cloud Dataflow in the sense that instead of writing mapreduces and chaining them together manually, you can treat your data as fancy arrays. In Spark, these fancy arrays are called RDDs. In Dataflow, they are called PCollections. Building pipelines using these fancy arrays is really much nicer than having to write lots of mapreduces which are too low level.

Yes, you have less control by default and if something breaks, it might be harder to trace, but I think it is becoming or is already the de facto way of building big data pipelines, and everyone is chipping away at these shortcomings.

This is also what inspired the “fancy query language” I spoke of last time. Disregarding that, I think it would be nice if Spark can connect to Dgraph as a backend somehow.

There is GraphX, which is built on Spark, for graph operations. I would guess that it is just a wrapper around RDDs as triplet stores. Imagine if we could hook up to GraphX and offer a speedup over using classic RDDs. We can call these graph RDDs and they can wrap around Dgraph.

mrjn · September 23, 2016, 5:05am

Isn’t Spark designed for backend tasks like Mapreduce? We have a different focus – realtime query load. Dgraph has a deadline of 1 minute or so, to execute a query, otherwise, the query would be cancelled. Dgraph is more like a database, and not necessarily a pipeline.

Though, once we support Gremlin, we should automatically work with Tinkerpop, which a bunch of these setups use.

jchiu · September 23, 2016, 5:13am

I see. Yes, Spark is mainly for batch processing, so it is not that applicable here.

Spark streaming is Spark’s attempt to appear more real-time. It still however works by batching over a small period of time.

I agree that supporting Gremlin is a better thing to do and that should be the next target.

eshwar · March 7, 2019, 11:28am

Hi @mrjn,

Does dgraph support Gremlin now?

I too have a use case where I need to run complex graph algorithms for the graphs stored in dgraph.

What do you suggest Dgraph + graphX(from spark) or dgraph + Tinkerpop?

Thanks,
Eshwar

mrjn · March 7, 2019, 3:15pm

We don’t support Gremlin yet.

Topic		Replies	Views
How about a graph compute engine based on dgraph except for spark? Dgraph kind:question	4	777	December 28, 2020
Spark Connector for dgraph Dgraph	27	3729	July 26, 2020
[GitHub] Spark connector for Dgraph Dgraph dgraph , status:accepted , area:integrations , kind:feature , popular	10	1440	July 25, 2020
Flink / Spark connector to Dgraph Users	2	2042	November 28, 2017
Build Kafka Connector for Dgraph in Live and Bulk Loader Dgraph dgraph , area:integrations , kind:feature , popular , status:needs-specs	10	1360	January 16, 2020

Spark, GraphX, machine learning

Related topics