Spark, GraphX, machine learning

Spark is very much like Google Cloud Dataflow in the sense that instead of writing mapreduces and chaining them together manually, you can treat your data as fancy arrays. In Spark, these fancy arrays are called RDDs. In Dataflow, they are called PCollections. Building pipelines using these fancy arrays is really much nicer than having to write lots of mapreduces which are too low level.

Yes, you have less control by default and if something breaks, it might be harder to trace, but I think it is becoming or is already the de facto way of building big data pipelines, and everyone is chipping away at these shortcomings.

This is also what inspired the “fancy query language” I spoke of last time. Disregarding that, I think it would be nice if Spark can connect to Dgraph as a backend somehow.

There is GraphX, which is built on Spark, for graph operations. I would guess that it is just a wrapper around RDDs as triplet stores. Imagine if we could hook up to GraphX and offer a speedup over using classic RDDs. We can call these graph RDDs and they can wrap around Dgraph.

2 Likes

Isn’t Spark designed for backend tasks like Mapreduce? We have a different focus – realtime query load. Dgraph has a deadline of 1 minute or so, to execute a query, otherwise, the query would be cancelled. Dgraph is more like a database, and not necessarily a pipeline.

Though, once we support Gremlin, we should automatically work with Tinkerpop, which a bunch of these setups use.

I see. Yes, Spark is mainly for batch processing, so it is not that applicable here.

Spark streaming is Spark’s attempt to appear more real-time. It still however works by batching over a small period of time.

I agree that supporting Gremlin is a better thing to do and that should be the next target.

Hi @mrjn,

Does dgraph support Gremlin now?

I too have a use case where I need to run complex graph algorithms for the graphs stored in dgraph.

What do you suggest Dgraph + graphX(from spark) or dgraph + Tinkerpop?

Thanks,
Eshwar

We don’t support Gremlin yet.