Graph Data Science on Dgraph

Hi folks!

We are a startup building a social learning tool based on a graph database. We require a sufficiently good recommendation system, as well as setting an up API with predictions, and the two alternatives when choosing the graph database (as for many) is Neo4j and Dgraph (we will use GraphQL regardless).

As the data science components (search algorithms, predictions) is a core component, we are very tempted to choose Neo4j due to their Graph Data Science Library. However, we like the “graphQL all the way”-aspect of Dgraph, as well as the fact that Dgraph is built on Go (vs the slower Java/Scala for Neo4j) - in addition to the (most likely) stiffer pricing of Neo4j.

We saw that the Data science library for Dgraph is very limited ATM, so we wonder is there is a smooth way to incorporate graph algorithms in queries in Dgraph, or in general some tips on how we can create a recommendation system without much fuzz and/or extra costs (expect for hosting due to the need of some hefty parallelization)?

Best regards!

3 Likes

I think the current best way to run graph algorithms on Dgraph data is to load your graph / sub-graph into Spark and use one of the existing Spark graph algorithms libraries.

There is not much (graph) computational power implemented in Dgraph and it does not look like this is anywhere on the road. Which is fine given you can couple a system that is great in storing and query graph data with a system that is great with any big data and graph processing.

3 Likes

you can calculate amazing stuff just with DQL. no extra library needed. everything works out of the box thanks to the awesome DQL features like value variables that also sum up

check out that tutorial

you will be amazed by how f*cking easy it is to build with Dgraph an awesome reliable recommendation system with just few lines

1 Like

Thank you very much - that makes sense, and is in fact pretty much the same approach as using Neo4j Graph Data Science Library (GDS) with Neo4j as Apache Spark and Neo4j GDS is pretty much the same (Spark is even better in terms of existing algos, but the concept is the same).

My only concern here is that we are building a knowledge graph, so we need to create embeddings that is always up-to-date with the current data stored in Dgraph, i.e. the current graph (we aim to build a system that can potentially host millions of user in best case). Thus, we need to be able to mutate the graph with embeddings all the time at a potentially very high frequency. Do you think that this would be a bigger problem with this solution (Spark+Dgraph) than the Neo4j version?

Thank you!

This approach may absolutely be useful for some of the basic things we aim to do, but we will most likely need some more more complex/clever algorithms to build what we aim to build. The reason for this is that we need to create a very good knowledge graph - which most likely need embeddings which humans cannot understand (i.e. we most likely need Clustering algos, GraphSAGE, Personalized PageRank etc.). I really don´t think I want to implement these algos from scratch in DQL.

Disclaimer: My answer might be due to lack of knowledge of the features of/opportunities in DQL, so I´d be happy to be corrected :blush:

1 Like

I’ll be honest: I think the right path here is Neo4j. To do this in DGraph what you’ll do (which is what I’ve done) is do computations in Apache SPARK, write those tables out as a json file, then write an importer that puts them in your DGraph. Likewise, a lot of the clustering and other algorithms you want are not implemented in Apache SPARK.

This blog is great. :+1:

1 Like

Thank you, sir!

We will still see - we still might need the speed advantage and/or the better scalability of Dgraph over Neo4j, so we might end up building an own set of models with tensorflow/keras instead. But yeah, it seems that Neo4j is the way to go for now due to simplicity of development - we can just create simulations to test for latency and scaling problems, and that should be a rather easy job :blush: