My answer is going to be a bit different from @MichelDiz.
A Thought Exercise
We start with a thought exercise, how would you store a graph? Both as an abstract data structure and as a concrete, stored-on-disk object.
We’ll use a concrete, running example for this graph, expressed nicely in the lovely dot language (the intent is not to confuse, but to illustrate, which is the sole purpose of graphviz
):
digraph{
a -> b;
b -> c;
b -> d;
d -> a
}
This gives a graph that looks like this:
How do we represent and store this graph? I will walk thru some representations and then discuss storage.
Representations: Abstract Data Structures
There are many abstract data structures available to us to store a graph.
Matrices
We can for example, represent the graph as a matrix (in a real matrix, only the numbers are stored, the labels in bold are written here for clarity):
|
a |
b |
c |
d |
a |
0 |
1 |
0 |
0 |
b |
0 |
0 |
1 |
1 |
c |
0 |
0 |
0 |
0 |
d |
1 |
0 |
0 |
0 |
Here, 1
represents a directed edge from row to column. So, to read the matrix, read it row-wise. The row in italics is read:
“b
is not connected to a
; b
is not connected to b
; b
is connected to c
; b
is connected to d
”
There are two ways a graph can be representated as a matrix. What I have listed here is called an incidence matrix.
Adjacency Lists
We can represent the graph as an adjacency list:
a: [b]
b: [c, d]
c: []
d: [a]
This may be implemented as an associative array/map/dict, where the items to the left of the :
are the keys and the items to the right are the values. The values are a list of connected nodes.
Edge Lists
We can represent the graph as an edge list:
{a, b}
{b, c}
{b, d}
{d, a}
Here, each {V,V}
pair represents a (directed) edge in the graph. Having this list allows you to recreate the graph entirely.
Naive Representation
Finally, we also have representation of graphs by pointers. I will use the Go syntax for clarity:
type Node struct{
pointsTo []*Node
}
Storage: Serialization of Data Structures
Having now explained some of the possible ways to represent graphs, let’s look at how graphs are stored. We will consider storage as an act of serializing the data in the data structure in a flat file.
With the exception of the matrices, all the other representations assumes that a node is a pointer to a datatype.
In this section I will also give a lot of examples of what gets written to a flat file. Things written above ----
in the examples represents metadata of the graph (things like node names node values). Things written below the ----
line in the examples represent data of the graph.
Also, for the examples I will go with a convention of having a newline represent a new data type.
Matrices
The matrix is by far the easiest to serialize. A matrix is essentially a flat data structure that has been given special access patterns to mimic structure. For more info, see my talk on generic multidimensional arrays in Go.
In the example above, we’d simply write this array of 1
s and 0
s to a file:
0100001100001000
Additional metadata (such as node information, how many nodes there are in a graph) can be written as additional lines in the flat file. So our flat file might look something like this in its entirety:
nodes:4
names:a,b,c,d
----
0100001100001000
So, if it’s so easy to store, why not use matrices? Because with matrix representations of graphs, it’s very difficult to add new nodes. Every time you add a new node, you need to reallocate a new matrix, and then copy the old matrix into the new matrix. It is inefficient.
Adjacency Lists
How do you store an adjacency list? Well, as the name suggests, it’s a “list”, so there is somewhat a flat structure already.
A naive implementation of adjacency list would look something like this:
type Graph map[*Node][]*Node
type Node struct { name string }
So with this, we see that we have to first serialize a lookup table to identify which node goes where.
One way of serialization is something like this:
nodes:4
names:a,b,c,d
---
0|1
1|2,3
3|0
Let’s take a closer look:
The metadata indicates that there are 4 nodes in this graph. The nodes are named, and indexed by their name, so that node 0 has a name a
, and node 1 has a name b
and so on and so forth.
In the data, we see that the nodes are represented by their numeric IDs. One can quite easily see the adjacency list in the serialization.
The issue with serializing adjacency lists is that the adjacencies may be variable in length. For example, node 1 has 2 listings.
This makes serialization/deserialization a bit harder, as you now have to parse and then split by "|
| and then split by `“,”.
Edge Lists
Edge lists also requires the serialization of metadata. We’ll serialize it in the same way as we did for adjacency lists, leaving a flat file that looks like this
nodes:4
names:a,b,c,d
----
0 1 1 2 1 3 3 9
Here we can see simply parse that one line of data, split by " "
and then read every 2 items as an edge. There is less work to do when deserializing.
The graphviz snippet from right at the top of the page also shows a graph in edge list format.
Naive Representations
Naive representations are the hardest to serialize. Because the inherent structure is not a list, you’d first have to walk the graph to generate a list of nodes, either in BFS or DFS order. Only then can you serialize.
I’ll leave this as an exercise to the reader.
What is Native Graph Storage?
So, if you have gotten to this point, you will have noted that there are many ways to store a graph. Which one is native? All of them are!
Or none of them are! Computer memory has a native flat structure - any other structures desired must be imposed upon the flat structure and usually come with some implicit implementation (e.g. with Eytzinger arrays and the like).
From this, we can conclude that “native graph storage” is just marketing mumbo-jumbo.
In Dgraph, we use an extended edition of an edge list. We call it a PostingList
.
On Key Value Stores
If you have gone through the examples above, you might note that in almost all serialization methods, you could structure the data as some sort of key:value pair. (In fact, tuples are pretty damn powerful objects - you can invent a large portion of mathematics just armed with tuples and some underlying theory(your choice of set theory, homotopy type theory or other things))
In the adjacency list for example, the key is the current node, the value is a list of nodes it is connected to. In the edge list for example, a key is the current node, and a value is another node.
Even the metadata are key:value pairs!
For some reason, edge lists are a good balance of performance and ease of access. Paired with a good index and good algorithms to find things in an edge list, you have a pretty good graph structure.
With this in mind, we built Badger to service our need. It allows us to serve PostingList
s with great efficiency.
p/s:
I have read the entire ebook from TigerGraph. Sadly it doesn’t mention how they store graphs. The chapter you recommended for reading only serves to spread FUD on building your own graph architecture.