For anyone who is interested in learning about Dgraph technology, we have a Dgraph paper available for download. The paper describes in fair detail the internals of Dgraph. However, I was surprised to find that we didn’t have any blogs or articles around the basics of Dgraph, providing much-needed context to the dense academic language. Today, that ends! This is the start of a new series of blog posts that will use the Dgraph paper as a guide, providing background, context, and basic principles for the topics in each section of the paper.
Let’s dive into the fundamentals of the technology behind Dgraph.
What is a Graph?A graph is made up of vertices and edges. To make things a bit more concrete, you may think of a vertex (or node) as a dot drawn on paper. When you have multiple nodes on paper, you may connect any two nodes with a line. That line is called an edge. Sometimes there can be multiple edges from one node to another.
Edges may have arrows (directed edges) or have no arrows (undirected edges). A directed edge represents a one-way relationship. An undirected edge simply represents the existence of a relationship. No, you did not read that wrong. An undirected edge DOES NOT represent a two-way relationship.
A graph is an abstract notion. A graph can represent many things.
For example, the same graph above can represent a social network. Each node is a person, and each edge is a friendship. Here I use an undirected edge representing the existence of a friendship.
The same graph can also represent molecular bonds (first person to figure out the chemical wins admiration from me).
As you can see, graphs allow for very flexible representations of things. Many things and the relationship between them can be represented as graphs.
How are Graphs RepresentedIn the section above, a graph is represented as a drawing. Unfortunately, computers do not reason well with drawings. So we have to represent them in ways that a computer could understand them. There are many ways graphs may be represented. In this post, I'll start with an abstract representation that is closest to the way Dgraph represents them. First, we'll have to label the nodes. The most convenient labeling scheme is to use numbers. Edges are directed edges, so there is a "from" and a "to".
Now let’s consider how the nodes are connected, and we’ll list them out:Node From Node To 1 2 1 16 2 3 2 7 3 4 3 17 4 5 4 18 5 6 5 19 6 1 6 20 7 8 7 22 7 21 8 9 8 10 9 11 9 12 9 13 10 14 10 15
Now we have a list of pairs of nodes. This is commonly known as an edge list. As you can see, it’s simply a list of edges. If we have multiple edges, then we will need a third column, indicating the kind of edge. For now, the edges are unlabelled.
What exactly then are the nodes? The nodes are simply numbers.
Data and Graphs
You may notice that in the graphs described in the sections above, neither the node nor edges carry any data. All we had was a labeling of the nodes with a number. Sometimes, this is all the information that is required. For example, if we are building a map application like Google Maps or Apple Maps, all we need are the labeled nodes representing areas of interests and the edges representing traversable paths (e.g. roads if by car, sidewalk if by walking).
However, Dgraph is a graph database. We want to store data in graphs. What does it mean to store data in graphs? From now on, I will use a simpler version of the graph above, as we will be drawing a lot more things.
Storing Data in the NodeOne way of storing data is to store data in the node itself. For example, let's say a node represents a person. A person has a name and age. So we may store the data in the node. So instead of a number, we turn our node into a data structure - a list of key:value pairs that hold the name and edge.
The downside to this sort of representation is that it is very rigid. If you want to add more fields - for example, we also want to note what language each person speaks - we’d have to regenerate the whole graph.
Storing Data Elsewhere
The alternate way is to reuse the numbers as an identifier, and then we can look up the data elsewhere. Let’s take a look at the edgelist (I’ve truncated it for readability):Node From Node To 1 2 2 3 2 7 3 4
Now let’s give these nodes names and ages and put them in a lookup table:Node Name Age 1 Charles 20 2 Chloe 20 3 Chris 20 4 Claire 20 7 Calvin 20
The downside to this sort of representation is that when you want to query the data, you would need to find the data elsewhere. This extra step slows things down. However, this is still a viable approach. Some popular graph databases store their data this way.
Storing Data in Edges
If it’s not a good idea to store data in nodes, we might then look at storing data in edges. However, upon inspection of the edgelist, it would seem very silly.
Looking at the table once more:Node From Node To Data 1 2 ??? 2 3 ??? 2 7 ??? 3 4 ???
Observe that in the data, there are many repeats. There are three copies of “Chloe” and two copies of “Chris”.
The downside to this sort of representation is that you would need a lot of memory to store very little data. Furthermore, updating the data would require updating in many places.
What Dgraph chose to do is to store data as edges. But in order to fully explain this, I’d first need to explain predicates.
Predicates: The Type of Edge
I had alluded to this earlier - a graph may have multiple edges between two nodes. Thus we would have to give names to each edge. Take the social network example. Previously we had determined the edges to be a friendship. Now, let’s change what an edge represents to “knows” - so if there is an edge between Alice and Bob, then we can say “Alice knows Bob”. We’ll also change the edge to be a directed edge.
Now human relationships are subtle and many-layered. There are many types of relationships. Imagine the social network example above represents a social network of authors. Authors influence each others’ works all the time. So another type of relationship could be “influences”. In fact let’s use that:
In the graph above, black edges represent “knows” while red edges represents “influences”. So Chris knows Claire. And Claire influences Chris.
Most importantly, this graph has two TYPES of edges:
In Dgraph, we call the type of edges a Predicate.
Storing Data as Edges
Having introduced the notion of a predicate, we can now press on to see how Dgraph stores data.
Recall that we may store data in the table that defines a graph. But storing data in an edge is wasteful. However, if we make the properties of a node into an edge, then things would change quite dramatically.Still using the example of the social network, now let's turn the properties "Name" and "Age" into edges. The tables are gone because we're no longer storing the nodes with their properties. So we replace them with dots: Blue edges represent the "Name" predicate. Purple edges represent the "Age" predicate. Black edges represent the "knows" predicate. Red edges represent the "influences" predicate. This looks VERY complicated. However, it’s simpler when viewed as a table (I’ve truncated the table to ensure that it’s readable): Node From Node To Data 1 2 1 23 20 1 24 Charles 2 23 20 2 25 Chloe 2 3 3 23 20 3 26 Chris … … … Here we see there are some edges that carry data and some don’t. Those edges that carry data only carry the data of the “To” node. Congratulations. You now have a rough understanding of the primary data structure that powers Dgraph: the Posting List. An in-depth analysis of the Posting List will be done later in this blog series. The next blog post will talk about types and why we need them. Here’s a fun fact: the footnotes form a graph. See if you can figure it out! Tags graph representation node edge graph database basics graph DB 101
This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/graph-basics/