How to efficiently find shortest and longest paths between node types in Dgraph (similar to APOC in Neo4j)?

I’m trying to find the shortest and longest path between two node types across the entire graph in Dgraph, similar to how it’s done using APOC in Neo4j.

In Neo4j, I can use a single query like this:

MATCH (start: `ntype1`)
CALL apoc.path.expandConfig(start, {
    labelFilter: '>ntype2',
    minLevel: 1,
    uniqueness: 'NODE_GLOBAL'
})
YIELD path
RETURN min(length(path)) as min, max(length(path)) as max

This efficiently returns the shortest and longest path lengths between any nodes of type ntype1 to ntype2 in one go.


What I’m doing in Dgraph:

Since Dgraph doesn’t support such a direct query, I’m first fetching all UIDs of source and target node types, and then looping over combinations to run multiple shortest queries:

combined_query = f"""
{{
  sources(func: type({ntype1})) {{ uid  RELATED_TO{{uid}} ~RELATED_TO{{uid}} }}
  targets(func: type({ntype2})) {{ uid  RELATED_TO{{uid}} ~RELATED_TO{{uid}} }}
}}
"""
result = dgraphManager.query(combined_query)

source_uids = [x['uid'] for x in result['sources'] if 'RELATED_TO' in x or '~RELATED_TO' in x]
target_uids = [x['uid'] for x in result['targets'] if 'RELATED_TO' in x or '~RELATED_TO' in x]

uid_list = [{"from": s, "to": t} for s in source_uids for t in target_uids]

query_parts = []
result_parts = []
for i, item in enumerate(uid_list, 1):
    query_parts.append(f"""
        var(func: uid({item['from']})) {{
           start{i} as uid
        }}
        var(func: uid({item['to']})) {{
           end{i} as uid
        }}
        path{i} as shortest(from: uid(start{i}), to: uid(end{i})) {{ RELATED_TO ~RELATED_TO }}
    """)
    result_parts.append(f"result{i}(func: uid(path{i})) {{ uid }}")

final_query = "{\n" + "\n".join(query_parts) + "\n" + "\n".join(result_parts) + "\n}"
data = dgraphManager.query(final_query)

path_lengths = [len(val) for key, val in data.items() if val]

Problem:

This approach works, but it’s very inefficient when handling numerous nodes. For example, if ntype1 has 100 nodes and ntype2 has 50 nodes, it results in executing 5,000 queries — just to determine path lengths between two types. Sometimes, it causes errors related to query timeout.


Question:

Is there a more efficient way in Dgraph to compute:

  • The shortest path between any node of type A to any node of type B
  • The longest such path

Ideally similar to how APOC works in Neo4j, with just one query.

Any insights or optimizations would be highly appreciated!

As you’re acutely aware, the shortest function only operates on uids…

Since you’re using Python, you may want to check out the extract_dict function in pydgraph. It was built specifically for converting DQL query results to a dict representing nodes and a list representing edges. Function: link. Test: link

Once you have those two artifacts, you can use them in networkx or any other graph algo package you like to accomplish your objectives.

Here’s a notebook that uses a locally defined extract_dict (it was later moved to pydgraph) and moves the results into networkx and then performs a number of graph analyses on the data.

Thank you for your response. I appreciate the clarity you’ve provided. I’m interested in understanding how the extract_dict function, which recursively extracts nodes and edges from a dictionary generated by a Dgraph query, can be utilized to determine both the shortest and longest paths between two nodes. Your insights on this would be very valuable.

Sorry for the delay. I extended a notebook in our dgraph-experimental repo that explores this in detail. Here’s the link to the notebook. If you have Docker and Jupyter Notebooks installed, you can run this locally. If not, I’ve included the output of the notebook in the commit, so you can see the output just by loading the file in your browser.

As I stated earlier, the sole native path function in Dgraph (shortest) relies on knowing the UIDs from two separate nodes. So natively running queries across paths in the entire graph is not possible. However, using pydgraph and networkx you can perform all sorts of graph analysis, not just path calculations. But since the topic is about shortest and longest shortest, here’s a brief overview. I definitely would recommend looking at the notebook linked above tho, as this is just a quick summary… Note that this notebook uses our stock Friend of a Friend graph, in which there are Person nodes, connected to other Person nodes via a Person.friends predicate.

  1. Load nodes and edges from Dgraph, convert and import to networkx
import json, pydgraph, networkx as nx

# Query Dgraph for Person nodes and friendships
client = pydgraph.DgraphClient(pydgraph.DgraphClientStub('localhost:9080'))
query = """
{
  person(func: type(Person)) @recurse {
    id: uid
    name: Person.name
    friends: Person.friends
  }
}
"""
txn = client.txn(read_only=True)
res = txn.query(query=query)

nodes = {}
edges = []
# Here we use the pydgraph `convert function to extract nodes and edges from the DQL result
convert.extract_dict(nodes=nodes, edges=edges, data=json.loads(res.json))

nodes_df = pd.DataFrame.from_dict(nodes, orient = 'index')
edges_df = pd.DataFrame(edges)

G = nx.from_pandas_edgelist(
    edges_df,
    source="src",
    target="dst",
    create_using=nx.MultiDiGraph()
)
  1. Find longest shortest path

Since friendship graphs are cyclic by nature, we can’t use nx.dag_longest_path(), which is only for acyclic graphs. Instead, we use the graph diameter - the longest shortest path between any two nodes, representing maximum degrees of separation:

# Find longest shortest path (diameter cmp approach)
longest_path = []
max_length = 0

for source, targets in nx.shortest_path(G).items():
    for target, path in targets.items():
        if len(path) > max_length:
            max_length = len(path)
            longest_path = path

# Display with readable names
path_names = [node_id_to_name.get(node_id, node_id) for node_id in longest_path]
print(f"Diameter: {max_length-1} edges")
print(f"Path: {' -> '.join(path_names)}")
1 Like

Thank you, @matthewmcneely, for providing this information; it is greatly appreciated. The system is performing effectively for the majority of the data we are analyzing. However, I have identified a few inconsistencies that need to be addressed.

Here is the code we are currently using:

import networkx as nx
from pydgraph import convert
from typing import List


class NetworkXPathAnalyzer:
    def __init__(self):
        self.graph = None
        self.node_types = {}

    def load_graph_from_dgraph(self,
                               node_types: List[str],
                               edge_predicates: List[str],
                               max_depth: int = 10) -> nx.MultiDiGraph:
        """
        Load graph data from Dgraph into NetworkX for analysis.
        More efficient than multiple shortest path queries.
        """

        # Build comprehensive query using @recurse
        type_filter = " OR ".join([f"type({nt})" for nt in node_types])
        edge_filter = "\n        ".join(edge_predicates + [f"~{ep}" for ep in edge_predicates])

        query = f"""
        {{
            nodes(func: has(dgraph.type)) @filter({type_filter}) @recurse(depth: {max_depth}) {{
                id:uid
                dgraph.type
                {edge_filter}
            }}
        }}
        """

        print("Loading graph data from Dgraph...")
        res = dgraphManager.query(query) # Our method to run query

        # Use pydgraph's extract_dict to convert to nodes and edges
        nodes = {}
        edges = []
        convert.extract_dict(nodes=nodes, edges=edges, data=res)

        # Create NetworkX graph
        self.graph = nx.MultiDiGraph()

        # Add nodes with type information
        for node_id, node_data in nodes.items():
            node_type = node_data.get('dgraph.type', ['Unknown'])[0] if isinstance(
                node_data.get('dgraph.type'), list) else node_data.get('dgraph.type', 'Unknown')
            self.graph.add_node(node_id, node_type=node_type)
            self.node_types[node_id] = node_type

        # Add edges
        for edge in edges:
            self.graph.add_edge(edge['src'], edge['dst'], predicate=edge.get('predicate', 'unknown'))

        print(f"Loaded graph with {self.graph.number_of_nodes()} nodes and {self.graph.number_of_edges()} edges")
        return self.graph

    def find_paths_between_types(self,
                                 source_type: str,
                                 target_type: str,
                                 max_paths_to_analyze: int = 10000,
                                 count_self_loops_as_one: bool = True):
        """
        Find shortest and longest paths between node types using bidirectional_dijkstra.
        Much more efficient than multiple Dgraph queries.
        """

        if self.graph is None:
            raise ValueError("Graph not loaded. Call load_graph_from_dgraph first.")

        # Get nodes of specified types
        source_nodes = [node for node, data in self.graph.nodes(data=True)
                        if data.get('node_type') == source_type]
        target_nodes = [node for node, data in self.graph.nodes(data=True)
                        if data.get('node_type') == target_type]

        if not source_nodes or not target_nodes:
            return {"error": f"No nodes found for types {source_type} or {target_type}"}

        # Calculate all shortest paths using bidirectional_dijkstra
        all_path_lengths = []
        paths_analyzed = 0

        for source in source_nodes:
            if paths_analyzed >= max_paths_to_analyze:
                break

            for target in target_nodes:
                if paths_analyzed >= max_paths_to_analyze:
                    break

                try:
                    # Handle self-loops
                    if source == target:
                        if count_self_loops_as_one:
                            all_path_lengths.append(1)
                        else:
                            all_path_lengths.append(0)
                        paths_analyzed += 1
                        continue

                    # Use bidirectional_dijkstra for efficient pathfinding
                    path_length, path = nx.bidirectional_dijkstra(self.graph, source, target)

                    # bidirectional_dijkstra returns (length, path)
                    # For unweighted graphs, length equals number of edges
                    all_path_lengths.append(int(path_length))
                    paths_analyzed += 1

                except nx.NetworkXNoPath:
                    # No path exists between these nodes
                    continue
                except Exception as e:
                    print(f"Error processing path from {source} to {target}: {e}")
                    continue

        if not all_path_lengths:
            return {
                "shortest_path_length": 0,
                "longest_path_length": 0,
                "message": f"No paths found between {source_type} and {target_type}"
            }

        # Calculate comprehensive statistics
        return self._calculate_path_statistics(all_path_lengths, source_type, target_type)

    def _calculate_path_statistics(self, path_lengths: List[int],
                                   source_type: str, target_type: str):
        """Calculate comprehensive path statistics."""

        return {
            "source_type": source_type,
            "target_type": target_type,
            "shortest_path_length": min(path_lengths),
            "longest_path_length": max(path_lengths),
        }

Specifically, in both the query and the code, we expected the shortest path length to be 5, but the result returned was 8. Furthermore, when adjusting the depth of the @recurse query (e.g., increasing from 10 → 20 → 30 or decreasing from 10 → 5 → 2), the shortest length reported came back as 7, whereas we were anticipating 5. we tried both methods of networkx 1st shortest_path_length and 2nd bidirectional_dijkstra Both methods return 7 and 8, respectively, but not 5 as expected.

It seems the path calculation is being influenced by something we may not be accounting for correctly—either in the query recursion depth, how the edges are being extracted/added to the graph, or in the way NetworkX handles shortest paths within this setup.

It’s crucial that we clarify these discrepancies, and I am confident that addressing them will improve the accuracy of the results. I’d greatly appreciate any insights or suggestions on what might be causing the divergence between the expected and actual shortest path values.

Thank you again for your guidance and support!