Finding orphaned subtrees

We want to use a graph database to store the data model of documents, in our system. The data model for each document is be a single rooted tree. So in our graph database there would be hundreds of millions of single-rooted trees, one per document.

Each tree can be of arbitrary depth and arbitrary shape, though it is always a tree. Typically a document tree would consist of thousands of nodes. The nodes in the tree for a document have a document ID property, whose value is the ID of the document. The document ID is a UUID.

Occasionally and purposefully, a branch of the tree for a document is orphaned from the main tree for that document. This happens when content is deleted. At the time content is deleted, the branch corresponding to that content is simply un-parented from the tree. We do this for good reason, which I won’t go into. i.e. We cannot delete the branch at that point.

At a later time we want to garbage collect and remove these orphaned branches from the database. Essentially we want to query for nodes whose document ID is x, and that do not have a parent. And then delete that node and its descendants. i.e. Delete the orphaned branches for document with ID x.

Is that kind of query possible with Dgraph? Can it be done efficiently?

Thanks,
Michael-

You could something like this:

  • store all the potentially orphaned documents in a list
  • for each document in that list do a @recurse query to get all connected documents
  • if no root document is in that new list, then delete all found documents

If you know the root document of the document that is potentially being orphaned, you can do a shortest path query instead (https://dgraph.io/docs/query-language/kshortest-path-quries/) to check if it is still connected, and a recurse query afterwards to find all connected nodes.

I’m not exactly following your suggestion. Finding the orphaned branches is straight forward and efficient in Cypher when running against Neo4j, given that document_id is indexed. Here’s the Cypher query.

		MATCH (e:Entity {document_id: $documentId})
		WHERE NOT exists((e)-[:CHILD_OF]->()) AND e.id <> '#root'
		CALL apoc.path.subgraphAll(e, {
		    relationshipFilter : '<CHILD_OF',
		    optional: true
		}) yield nodes, relationships
                WITH nodes, relationships RETURN nodes, relationships

The root of the tree for every document has the document id of ‘#root’.

The first part of the query finds nodes that are orphaned from the tree for the document with document id $documentId. The second part finds the descendants of those orphaned nodes.

Is there a similar query for Dgraph?

1 Like

You may use “not has()”. If you share a sample(fake) of data I may be able to help. The data is better than a fancy cypher query. The data in JSON is good.

1 Like

Please clarify the intent of selecting node without CHILD_OF relation (WHERE NOT exists((e)-[:CHILD_OF]->()) and then following the relation CHILD_OF ( which should not be there according to the WHERE close), unless I missed something in the cypher query.
Also could you clarify the difference between id and element_id predicates.

As @MichelDiz mentioned, using type and filtering with not has should cover your use case to identify the nodes to start your query, then using a recurse directive with child_of predicate should return your tree.

1 Like

DocumentId is the id of the document the tree belongs to. id, is the id of the individual node. So each node has both a documentId and id.

I will come up with some sample data.