Gremlin Support in DGraph

Gremlin is a graph traversal language. Gremlin makes use of Pipes to perform complex graph traversals.

Here are few examples of gremlin to get an hang of it.

  1. g.v(1); // get vertex by Id
  2. g.V[1…100] // get all vertex with id in range 1 to 100
  3. g.v(1).firstName; // get the attribute of vertex by id
  4. g.V(‘firstName’,‘John’); // get vertex with firstName as john
  5. g.V(‘firsName’,‘John’).count(); //get count
  6. g.v(1).outE(‘friend’); // outgoing friend edges
  7. g.v(1).inE(‘friend’); // incoming friend edges
  8. g.V.and(_().has(“age”, T.gt, 25), _().has(“age”, T.lt, 35)); // find all people with age between 25 to 35
  9. g.V.interval(“age”, 25, 35); // same as above
  10. g.V.has(‘email’, null) // people who don’t have email addresses

In addition to above gremlin supports graph manipulation queries adding edges, nodes and creating indexes.

But any graph algorithm is not just about getting data. Its about seeing how paths intersect/overlay/loop amongst themselves and each other. Paths tell you a lot about the “structure” of your data — topological statistics. PageRank, Betweenness, Eccentricity, Eigenvectors, Recommendations, Spreading Activation, …. these algorithms are all about the paths, not the data.

Gremlin/Cypher allows you to specify such queries which we can’t do in graphQL.
http://tinkerpop.apache.org/docs/3.2.1-SNAPSHOT/recipes/#shortest-path

g.V(1).repeat(out().simplePath()).until(hasId(5)).path().limit(1)
The traversal starts at vertex with the identifier of “1” and repeatedly traverses on out edges “until” it finds a vertex with an identifier of “5”.

Here simplepath() ensures that traverser doesn’t repeat a path in the graph.

There are many types of traversals supported by gremlin.

Gremlin also supports declarative form of query in addition to imperative form.
example: “Who created a project named ‘lop’ that was also created by someone who is 29 years old? Return the two creators.”

g.V().match(
__.as(‘creators’).out(‘created’).has(‘name’, ‘lop’).as(‘projects’), //(1)
__.as(‘projects’).in(‘created’).has(‘age’, 29).as(‘cocreators’)). //(2)
select(‘creators’,‘cocreators’).by(‘name’)

explanation: Find vertices that created something and match them as creators, then find out what they created which is named lop and match these vertices as projects. Using these projects vertices, find out their creators aged 29 and remember these as cocreators. Return the name of both creators and cocreators.

Gremlin is both a language and VM just like java(jvm). All the graph languages supported by gremlin get converted to gremlin bytecode which is executed by gremlin VM.
http://tinkerpop.apache.org/docs/3.2.3/reference/#traversal (Instruction set for gremlin bytecode)

With tinkerpop enabled stack, any tinkerpop enabled graph language providers can use DGraph. Any tinkerpop enabled graph processors(spark or hadoop giraph) can be used to run OLAP queries over DGraph like finding page rank etc.

A Gremlin traversal machine has a collection of traversal strategies. Some of these traversal strategies are specific to Gremlin (e.g. optimization strategies) and some are specific to the underlying graph system (e.g. provider optimization strategies). Gremlin-specific traversal strategies rewrite a traversal into a semantically-equivalent, though (typically) more optimal form. Similarly, provider-specific strategies mutate a traversal so as to leverage vendor-specific features such as graph-centric indices, vertex-centric indices, push-down predicates, batch-retrieval, schema validation, etc. Now that a traversal is represented in the graph as vertices and edges, Gremlin can traverse it and thus, rewrite it.

A classic example of provider specific strategy is index lookups. g.V().has(‘name’,’marko’) can be a single index call as opposed to iterating out all vertices and filtering one by one.

The implementation of TinkerPop’s core API(java) and its validation via the gremlin-test suite is all that is required of a graph system provider wishing to provide a TinkerPop3-enabled graph engine. The core api consists of

  1. Structure API: Graph, Element, Vertex, Edge, Property and Transaction (if transactions are supported).
  2. Process API: TraversalStrategy instances for optimizing Gremlin traversals to the provider’s graph system (i.e. TinkerGraphStepStrategy).

I was looking through various ways in which DGraph can be integrated with tinkerpop Stack.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/gremlin-users/fduLAQeCSF4/CHfjwPHvBQAJ is discussion on gremlin-users group regarding this.

Option1: The implementation of TinkerPop’s core API would be a java client that
optimizes and translates gremlin steps into Graphql and pass to dgraph. Here we can take
advantage of gremlin’s parser and just need to implement traversal strategy.

This is used by sqlg.(https://github.com/pietermartin/sqlg)
Everything can’t be expressed in subgraph format.

Option2: Implement RemoteConnection interface in dgraph which accepts gremlin bytecode via GraphSON(json)
and let dgraph internally convert it to subgraph and process the query.

Same issues as 1.

Option 3: Wrapping the DGraph in a Java API that implements TinkerPop’s structure (there would be an issue of java-go binding though) and we run a gremlin server on the host for remote connections. But traversal here would be controlled by gremlin through java api’s. Gremlin Server is part of tinkerpop stack which accepts gremlin bytecode from gremlin language variants and does the traversal. Without gremlin server there would be too many network calls. This option is normal implementation of their core API. Alternatively if we connect to database from client directly without Gremlin Server it would require better strategies for bulk operations to avoid network calls.

we would immediately be TinkerPop-compliant and over time we can create more/better strategies to do bulk operations and limit network bandwidth.
We won’t have much control over traversal, take the example of simplePath(), since in Dgraph predicates are sharded, Dgraph can find a path between two nodes with only one network call since all the edges would reside on a single machine.

For simple queries there is no issue in letting gremlin control the traversal i think.
According to Marko, You can always implement optimizations in Gremlin. Gremlin’s compiler model (TraversalStrategies) is very flexible

Edit:
Latest Comment from marko:
Yes! That is the point. Implement the TinkerPop structure API first and Gremlin will just work. Stage 1 complete. At this point, Apache TinkerPop will have full control of the execution. However, when you want to optimize, you write a strategy. This will delegate certain aspects of the traversal’s execution over to the vendor. For things the vendor can do better than Apache TinkerPop, it does. For things that the vendor can’t do better, Apache TinkerPop handles it.

I don’t know how many strategies Sqlg has, but I believe DSEGraph has 3 — one for global index use (g.V().has(’name’,’marko’)), one for vertex-centric index use (outE().has(‘time’,gt(2001))), and one for batch “get” of vertex properties.

HTH,
Marko.

It seems tinkerpop is flexible enough to let the graph vendor control some part of the traversal.

Option 4: Implementing gremlin machine in go.
Too tedious and very big project, if we are going this path, Marko Rodriguez has decided to help in implementation gremlin mini.

2 Likes

Here’s a link to the presentation on gremlin traversal machinery.

We can do all of the rest, but can’t do this one. We don’t automatically generate reverse edges. So, we can’t do inE.

We also can’t do this. To check for negations require looking through everything, which isn’t a possibility in a scalable system.

Yeah. Finding shortest path would need to be added to our version of GraphQL. Had a chat with the author of GraphQL, Lee Byron, and I think we’ll definitely have to extend and modify GraphQL for our usage. So, we could borrow concepts from both Gremlin and Cypher to make such things happen.

The schematics of the query is a bit weird. Both the in and the out are named created. But, this is something we can do with our existing system.

Interesting. So, this gremlin VM, is essentially Java based, and requires to be run as a Java program, yeah?

Very interesting thread! Trying to understand the options here. I think we should have a chat, sometime soon – so I can understand this better.

Overall, sounds like you made good progress! Nice work!

This makes it sound like DGraph is unable to find Vertices that originate an edge to some other Vertice.
Here are some example questions to make this case clearer.

  • In a Twitter setup:
    • List your followers
  • In a web page graph
    • List the pages that link to this page
  • In a Facebook graph
    • List people who have created friend requests to you

Is it correct that Dgraph does not allow these type of questions to be asked of the graph?

1 Like

In all these cases that you mention, @Astn, an edge in that direction would need to be present.

In Twitter, when you follow someone, essentially 2 edges need to be created: ME --follows--> YOU, and YOU --followed-by--> ME. We don’t assume or try to create the reverse edge. In my experience with graphs, a blind reverse edge generation can lead to many scalability problems. For e.g., you want to tag every person as a person. ME --type--> PERSON. If we generate a reverse edge for every such instance, it would blow up the number of things attached to PERSON, and won’t be very useful.

We could, though, have helper functions around this, where you can specify which predicates need to have reverse edges, and what should they be called; and we can have some automation around that. But, as a design choice, our edges are unidirectional.

Again, this is a lot easier to solve when you store the data as such. For e.g., with web pages, if you encounter a new page, store the outgoing link as from page -> to; and as to -incominglink-> from.

Having said this, if there is demand for such a feature, we could have an option; but it would need to be turned on, with the understanding that the disk usage would be doubled at least, and could blow up some of your indices; like the person example I mentioned, where one vertex could point to theoretically 7 billion other vertices.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.