(What is) Native Graph Storage

Hi, I just wanted to understand the point around Dgraph using Native Graph Storage. I understand it uses the Key Value Badger store (and there is limited processing/translation required - due to the key being composed of UIDs and UIDs + predicate names). I also know that DSE Graph uses Cassandra, but in the DSE Graph case it is seen as a “slap-on” type solution. How is a Badger vs a Cassandra under the hood different in terms of a “Native Graph Storage”?

Another part of the puzzle for me, is TigerGraph claim that any NoSQL / Key Value storage layer is maybe not a good idea, and summed up here https://cdn2.hubspot.net/hubfs/4114546/WhyGraphDatasheet.pdf
Focusing on the “Native Graph Storage” category, I guess they are claiming that Dgraph is 2.0? Btw, I have asked a question TG folks to explain exactly what their storage mechanism is and can update with any info i get.

Also would be interested in some counter points / Dgraph colour to the research that TigerGraph have in their ebbok “Resources - TigerGraph” Native Parallel Graph EBook" and their chapter “Building a Graph Database on a Key-Value Store?”

I can see TigerGraph are solving a different problem potentially, and am really just interested in the question around “Native Graph Storage”…

Thanks

Personally I know very little about Cassandra. I think others can step in to clarify this. For me, Cassandra is a SQL like DB(with their CQL) but with NoSQL characteristics. Not sure how their Graph solution works.

The SQL part is true, the NoSQL part doesn’t fits in Dgraph’s analogy. It says “NoSQL databases store all of the data in a single table.” Each predicate on Dgraph is stored separately. And Dgraph isn’t a Key Value logic per se. The graph connections happens natively due the UID system. I see no real difference between a Graph based on files in disk and Dgraph. Just the abstraction approach.

Maybe, need to evaluate each part of it.

That would be very useful for comparison.

Using Badger is a convenient way to handle the data. There is nothing fancy between Dgraph and Badger, it is as if Badger and Dgraph were one single thing (after all Badger is an embedded DB).

You can read the paper for more details Dgraph Whitepapers: In-Depth Insights and Analysis

More native than that, it would be necessary to create a direct filesystem control(Badger does that for Dgraph, Badger “abstracts” that for Dgraph). In SQL Server, each ROW is written as an individual file to disk. In Dgraph it is similar, except that Badger manages the files. It records everything in .vlog and .sst. Data, indexes, and other values ​​are recorded in these files. Accessed directly by Dgraph. The process is so minimal, that you shouldn’t even care that there is a KV-DB underneath.

The difference per se is in the sense of complexity between a non-native DB and Dgraph. A Graph DB that is based on SQL, it needs a SQL-DB and does SQL operations. As for Dgraph, don’t, it only needs to record in Keys and he controls the structures with his own algorithm. Besides having its own language.

Dgraph is not a translation as it would be in a Graph in SQL. That you need to translate your Graph language to SQL. I think the paper can help you out.

Cheers.

2 Likes

My answer is going to be a bit different from @MichelDiz.

A Thought Exercise

We start with a thought exercise, how would you store a graph? Both as an abstract data structure and as a concrete, stored-on-disk object.

We’ll use a concrete, running example for this graph, expressed nicely in the lovely dot language (the intent is not to confuse, but to illustrate, which is the sole purpose of graphviz):

digraph{
    a -> b;
    b -> c;
    b -> d;
    d -> a
}

This gives a graph that looks like this:

image

How do we represent and store this graph? I will walk thru some representations and then discuss storage.

Representations: Abstract Data Structures

There are many abstract data structures available to us to store a graph.

Matrices

We can for example, represent the graph as a matrix (in a real matrix, only the numbers are stored, the labels in bold are written here for clarity):

a b c d
a 0 1 0 0
b 0 0 1 1
c 0 0 0 0
d 1 0 0 0

Here, 1 represents a directed edge from row to column. So, to read the matrix, read it row-wise. The row in italics is read:

b is not connected to a; b is not connected to b; b is connected to c; b is connected to d

There are two ways a graph can be representated as a matrix. What I have listed here is called an incidence matrix.


Adjacency Lists

We can represent the graph as an adjacency list:

a: [b]
b: [c, d]
c: []
d: [a]

This may be implemented as an associative array/map/dict, where the items to the left of the : are the keys and the items to the right are the values. The values are a list of connected nodes.


Edge Lists

We can represent the graph as an edge list:

{a, b}
{b, c}
{b, d}
{d, a}

Here, each {V,V} pair represents a (directed) edge in the graph. Having this list allows you to recreate the graph entirely.


Naive Representation

Finally, we also have representation of graphs by pointers. I will use the Go syntax for clarity:

type Node struct{
     pointsTo []*Node
}

Storage: Serialization of Data Structures

Having now explained some of the possible ways to represent graphs, let’s look at how graphs are stored. We will consider storage as an act of serializing the data in the data structure in a flat file.

With the exception of the matrices, all the other representations assumes that a node is a pointer to a datatype.

In this section I will also give a lot of examples of what gets written to a flat file. Things written above ---- in the examples represents metadata of the graph (things like node names node values). Things written below the ---- line in the examples represent data of the graph.

Also, for the examples I will go with a convention of having a newline represent a new data type.

Matrices

The matrix is by far the easiest to serialize. A matrix is essentially a flat data structure that has been given special access patterns to mimic structure. For more info, see my talk on generic multidimensional arrays in Go.

In the example above, we’d simply write this array of 1s and 0s to a file:

0100001100001000

Additional metadata (such as node information, how many nodes there are in a graph) can be written as additional lines in the flat file. So our flat file might look something like this in its entirety:

nodes:4
names:a,b,c,d
----
0100001100001000

So, if it’s so easy to store, why not use matrices? Because with matrix representations of graphs, it’s very difficult to add new nodes. Every time you add a new node, you need to reallocate a new matrix, and then copy the old matrix into the new matrix. It is inefficient.

Adjacency Lists

How do you store an adjacency list? Well, as the name suggests, it’s a “list”, so there is somewhat a flat structure already.

A naive implementation of adjacency list would look something like this:

type Graph map[*Node][]*Node
type Node struct { name string }

So with this, we see that we have to first serialize a lookup table to identify which node goes where.

One way of serialization is something like this:

nodes:4
names:a,b,c,d
---
0|1
1|2,3
3|0

Let’s take a closer look:

The metadata indicates that there are 4 nodes in this graph. The nodes are named, and indexed by their name, so that node 0 has a name a, and node 1 has a name b and so on and so forth.

In the data, we see that the nodes are represented by their numeric IDs. One can quite easily see the adjacency list in the serialization.

The issue with serializing adjacency lists is that the adjacencies may be variable in length. For example, node 1 has 2 listings.
This makes serialization/deserialization a bit harder, as you now have to parse and then split by "|| and then split by `“,”.

Edge Lists

Edge lists also requires the serialization of metadata. We’ll serialize it in the same way as we did for adjacency lists, leaving a flat file that looks like this

nodes:4
names:a,b,c,d
----
0 1 1 2 1 3 3 9

Here we can see simply parse that one line of data, split by " " and then read every 2 items as an edge. There is less work to do when deserializing.

The graphviz snippet from right at the top of the page also shows a graph in edge list format.

Naive Representations

Naive representations are the hardest to serialize. Because the inherent structure is not a list, you’d first have to walk the graph to generate a list of nodes, either in BFS or DFS order. Only then can you serialize.

I’ll leave this as an exercise to the reader.

What is Native Graph Storage?

So, if you have gotten to this point, you will have noted that there are many ways to store a graph. Which one is native? All of them are!

Or none of them are! Computer memory has a native flat structure - any other structures desired must be imposed upon the flat structure and usually come with some implicit implementation (e.g. with Eytzinger arrays and the like).

From this, we can conclude that “native graph storage” is just marketing mumbo-jumbo.

In Dgraph, we use an extended edition of an edge list. We call it a PostingList.

On Key Value Stores

If you have gone through the examples above, you might note that in almost all serialization methods, you could structure the data as some sort of key:value pair. (In fact, tuples are pretty damn powerful objects - you can invent a large portion of mathematics just armed with tuples and some underlying theory(your choice of set theory, homotopy type theory or other things))

In the adjacency list for example, the key is the current node, the value is a list of nodes it is connected to. In the edge list for example, a key is the current node, and a value is another node.

Even the metadata are key:value pairs!

For some reason, edge lists are a good balance of performance and ease of access. Paired with a good index and good algorithms to find things in an edge list, you have a pretty good graph structure.

With this in mind, we built Badger to service our need. It allows us to serve PostingLists with great efficiency.

p/s:

I have read the entire ebook from TigerGraph. Sadly it doesn’t mention how they store graphs. The chapter you recommended for reading only serves to spread FUD on building your own graph architecture.

13 Likes

I’d recommend reading the paper on Dgraph: Graph Database White Paper | Dgraph . It has all the concepts of what makes Dgraph right and different from a graph layer.

3 Likes

Wow, that is amazing answer @chewxy
That has really clarified a bunch of blind spots for me and appreciate the effort

No problem. Happy to help. And I hope it was clear

1 Like

thanks @chewxy, as Columbo would say, one more thing…

Can Dgraphs “Native Graph Storage” / subsequent retrieval technique be described as “index-free adjacency”. It feels similar, but cant see that term used anywhere in Dgraph docs. Thanks

“Index free adjacency” is mostly a content-free marketing term I am afraid. There was a very interesting kerfuffle at wikipedia over this issue.

Basically what an index-free adjacency means that you don’t need a lookup to go to the next connecting node. If you store your data in a SQL database, then you would need multiple jumps to get to the next node, often requiring an “index”/“lookup” table .

This is easily done in Dgraph, because the data is not stored in row-based tables like SQL databases. So to answer your question, yes, it can be described as “index-free adjacency”.

Side note:

All databases require indices for fast lookup of data. Neo4j, which mainly pushes the term “index-free adjacency” has a root B+ tree index to facilitate lookups of nodes. Dgraph uses a heap-based index to facilitate lookups of PostingLists (@mrjn to fact check me).

The term is mainly marketing because it purposefully conflates the various meanings of “index”, so people think “aha, there is no need to do B+ trees like in SQL, win!”. Behind the scenes, it’d be nigh impossible to access data fast without good indices.

2 Likes

Posting lists is a search engine concept. The idea is to keep track of all doc ids which have a search term, so you can quickly look for one term or intersect multiple lists corresponding to multiple terms to find the right docs. Different from heaps.

2 Likes

Thanks. I stand corrected.

1 Like