I have several million pairs of part numbers. In each pair, the first part number is the “parent” part and the second is the “child” and is, of course, one to many. By loading these into a graph, I can take a top level parent and recursively process to produce a tree, representing the assembly structure of the product.
So in S-P-O, it would be something like “Part Has Part”. I haven’t seen any recursion query capability in the docs. In Gremlin I’d use “repeat()”. In an SQL database: a WITH RECURSIVE CTE.
While repeat() can take a max depth, it should be able to traverse until all leaf nodes are found.
In Gremlin, here is one I had used: g.V().has('name','Q1').repeat(out().simplePath()).emit(__.not(out().simplePath())).path().by('name') ==>[Q1, Q2, Q3, Q4]
In this case it produces the “path” (starting node to leaf node).
Also needed are cycle detection, in order to either avoid or to detect endless loops so they can be fixed.
While Oracle’s implementation is non-standard SQL, it is quite complete and easy to understand. For example, this page shows what can be done [here]
# To traverse the friends and relatives edges
# completely and get all their names.
{
recurse(id: <start>) {
friend
relative
name # This is a leaf node.
}
}
# To traverse the predicates 10 times
# and get their names.
{
recurse(id: <start>, depth: 10) {
has
name
}
}
We will avoid visiting the same node twice.
And will return the entire subtree as the result.
{ recurse(id: <start>) {
friend { name }
relative { name, age }
}
}
Only one more level is allowed in query that will help in knowing which attributes to fetch.
This will be easier to implement, as we know which are leaf attributes by looking at query,
rather then in previous query case.
As recursion traversal proceeds, wouldn’t be simpler to permit only a single edge type (predicate)? If multiples were needed I think a union of result sets would suffice, which I think you already have.
Since an edge type could easily have multiple node types on the to-side (object-side?), then you’d want a filter to restrict the traversal to just the node types desired.
I liked the depth constraint.
While you’d want to avoid visiting the same node twice to avoid infinite loops, you also need a way to detect such loops. So perhaps an option to flag such a node. Sometimes this is a data quality problem. But other times (like a airplane flight round-trip), it is “normal”. In the latter case, it is the desired result and the terminating condition for the traversal.
Once the traversal is done, the results need a “projection” so that the results have the attributes from nodes/edges in the returned result. For example, on the nodes, you might need the flight number or part number. On the edges, you’d want predicted flight time or quantity respectively.
Thanks, looking nice! I have a nice set of data I’d like try with this when its ready!
This can be returned as a list separately. Would that work? We can have a option saying, flagdup which can return these nodes if requested. Let me know what you think.[quote=“mandolyte, post:7, topic:1306”]
Since an edge type could easily have multiple node types on the to-side (object-side?), then you’d want a filter to restrict the traversal to just the node types desired.
[/quote]
This projection is part of the recursion. In dgraph the values are also stored as nodes. So, for example you want the flight number, you’d specify the corresponding predicate within the recurse query.
{
recurse(id: <flight-id>) {
destination
connecting-flight
name # value
flight-number # value
}
}
@mandolyte Support has been added in master. Any feedback on it with your data would be great! we’d love to help you load the dataset and run some queries on it if its alright with you!
I have a small dataset that I used with Gremlin/Titan. I’ll find it next week. Sigh… wish my day job gave me more time to explore this… I’m excited to see a graph database implemented in Go. I’ve always thought such an implementation could evolve faster than JVM based ones… So far, I think I’m right!
I found a tiny toy self-contained SQL that can be used with SQLite3. It illustrates the primary requirements for recursion: level (or depth) number, path generation, and cycle detection. Here is the output:
You can just paste this SQL into SQLite3 to get above output:
with dataset as
(
select 'A' as parent, 'B' as node union all
select 'B' as parent, 'C' as node union all
select 'C' as parent, 'D' as node union all
select 'D' as parent, 'A' as node
),
hierarchy( level, parent, node, path, cyclic_flag )
as
(
select 0 as level,
dataset.parent,
dataset.node,
'/' || dataset.parent || '/' || dataset.node as path,
0 as cyclic_flag
from dataset
where dataset.parent = 'A'
union all
select
hierarchy.level + 1 as level,
dataset.parent,
dataset.node,
hierarchy.path || '/' || dataset.node as path,
case
when
(length(path||dataset.node) - length(replace(path|| dataset.node,dataset.node,'')))/length(dataset.node) = 1
then 0
else 1
end as cyclic_flag
from hierarchy
inner join dataset
on dataset.parent = hierarchy.node
where (length(path) - length(replace(path,hierarchy.node,'')))/length(hierarchy.node) < 2
)
select *
from hierarchy
order by path
;
Hope to spend some time next on testing this with your tip changes. This toy example can be a functionality test. I’ll work up a larger dataset for a performance test.
Since I don’t know much about DGraph yet, I need to spend some serious time with your tutorials and documentation before I can get far on my own.
This is the current output. We’re working on tagging the node which created the cycle with a flag (So node A would be marked). I hope this is close to the example you have specified with SQL.
The following omissions at present all would require recursive parsing of the JSON returned value - which sort of defeats the purpose of having the graph engine add its value and smarts. Thoughts on these?
depth
cycle detection
paths: paths are needed ANSI SQL in order to keep the parent/child rows together; Oracle’s CONNECT BY method doesn’t need this. As a graph engine, DGraph may also work more like Oracle’s implementation, keeping the parent and children together. But even so, generation of the path is very useful in many reporting outputs. In Gremlin this is done by the “simplePath()” step.
Over the weekend, I did some reading and have begun to see that the differences between a normal directed property graph and an RDF-kind of graph are substantial. I’ll ask on another thread about this so that discussion is buried here.
Without the recursive reply, you’ll lose the parent-child connection. There is a @normalize directive though, which you can typically use to flatten out the graph response that Dgraph returns.
Finally have some datasets I am ready to use. I’m tempted to wait until DGraph is “go gettable” because our firewalls prevent me from doing a normal install at work.
I have a smaller dataset I can use at home - small enough to mail.
There is a bulk loader right? Can you point me to it? I plan to work thru the process at home; then take on the larger dataset at work when I solve the install issues. Hopefully, sometime next week will have some feedback for you.