Spark is very much like Google Cloud Dataflow in the sense that instead of writing mapreduces and chaining them together manually, you can treat your data as fancy arrays. In Spark, these fancy arrays are called RDDs. In Dataflow, they are called PCollections. Building pipelines using these fancy arrays is really much nicer than having to write lots of mapreduces which are too low level.
Yes, you have less control by default and if something breaks, it might be harder to trace, but I think it is becoming or is already the de facto way of building big data pipelines, and everyone is chipping away at these shortcomings.
This is also what inspired the “fancy query language” I spoke of last time. Disregarding that, I think it would be nice if Spark can connect to Dgraph as a backend somehow.
There is GraphX, which is built on Spark, for graph operations. I would guess that it is just a wrapper around RDDs as triplet stores. Imagine if we could hook up to GraphX and offer a speedup over using classic RDDs. We can call these graph RDDs and they can wrap around Dgraph.