Transitioning from a Relational DB to a Graph DB

Well, we made probably one of the most difficult choices halfway through Developing our Application, but it ended up being an easy decision. Here is our (somewhat) short story of making the transition from a Relational Databse (MySQL in our Case) to a Graph Database (Dgraph). Most of this post will be pointers on how to make the transition and what we learned that applies specifically to Dgraph but can also apply in the broader spectrum to most Graph Databases.

Our setup was complicated to say the least. Being very familiar with MySQL before we decided to use it in our Stack originally. We did nail down the fact that we wanted to separate the UI from the data API. We plan on developing more UI’s down the road and wanted to have access to the same business logic in an API layer. We decided that it was best to have a GraphQL API.

At that time, we thought we could build a good GraphQL API backed with a relational database. We liked the NoSQL concept of Mongo, but there were still some limitations and it did not fit well in our stack. We found that using MySQL 8.0.17+ we had access to many more JSON functions inside of MySQL which allowed some NoSQL type work within our schema controlled database.

This worked well for a little bit. Relationships started becoming harder to build out and maintain. The granular access control we wanted started to make matters more complicated. We could take a single column and store JSON in it and even search within that JSON using MySQL queries, but every time we turned around it seemed like we were hacking every step instead of finding an actual solution. We resolved the over-fetching of data (for the most part) by only selecting columns based on field selection in GraphQL, but we still had no great solution to the N+1 problem. Another big concern was creating nested filters using our custom built GraphQL API. At this time, our GraphQL API resolvers were over 50K lines of custom hand written code and we were 6 months into development.

After working with a GraphQL layer I realized that some functionality that we wanted would actually work better coming from the lower levels of the query up instead of the top level going down. I tried to conceptualize how to do this in a relational database, and it started to prove very costly in terms of performance and query times. We wanted to be able to filter by the JSON fields to just find the stored relationships, but that many times would mandate a full table scan, Yikes! We discussed about making tons of pivot tables, but that would complicate other matters worse and simple queries with JSON columns would then become complex JOINs and reduce flexibility that we wanted to have. So here we were 6 months into development and we threw in the towel and started to look for a true graph database.

Honestly, we were interested in either Neo4J or Dgraph. We were aware that Dgraph was upcoming and not yet full featured, and Neo4J has had a history of performance.We saw that both were still in active development, both had discourse community, and both had features that we believed could add value. The final decision between the two came down to two things: 1) Dgraph’s generated GraphQL layer. We really didn’t want to rewrite a full set of resolvers if we were really going to make this transition. But the bigger and final straw that made the decision was 2) the devs involvement in the community. It wasn’t just other users answering my questions, but the devs themselves. @mrjn Thank you for your core values and implementing new hire policies to require community involvement.

Okay, now with out history out of the way, what did we learned that helped make the transition from a relational database to a graph database?

Learn How to Speak

Jumping right into a graph database you will instantly realize that the lingo is way different. In RDBs have terms such as Tables, Row, Pivot Table, and Column/Field, Joins, Where clauses, Has Clauses, and Group By clauses. In most graph databases we will see terminology such as Types, Node, Edge, Predicate, Sub-Graphs, Filters, and Directives. These each map relatively closely in concept to the other:

  • Tables = a group of related data that consists of the same structured properties. In a GraphDB, this concept is referred to as a Type.
  • Row = a group of data linked together usually with a single id. In a GraphDB, these are usually referred to as a Node. However with a graphDB any node can have a uniquely set of properties. Like How we were using JSON fields to store differing object properties in a column on a row.
  • Pivot Tables = a table used to make many-to-many relationships between specific rows of specific tables. One-to-one relationships are usually done with a single column in the respective side for a link to the opposite. One-to-many relationships are usually done with a single column in the “many” side table linking to the one parent. This concept is referred to in a graphDB as simply an Edge. And Edge can be a one-to-one, one-to-many, or a many-to-many relationship type by specifying that it can link to a single Node or multiple Nodes using the Array syntax.
  • Column/Field = a predefined and controlled property in a table that reserves space for each Row. In a GraphDB, this concept is commonly called Predicates, or sometimes referred to as simply Properties of a Node. In theory, any Node can have any Predicate. An API layer may place additional logic that limits a Node Type to a determined set of Predicates.
  • Joins = a clause in a query to link together tables to select multiple tables of data into single rows, only capable of producing single rows of data. In a GraphDB, this is greatly simplified with Sub-Graphs, nested joined data is displayed as a sub-graph instead of single rows. This helps make the data transferred rate go down as the same data is not needed in every sub row of a parent row. It displays it in object property notation, just as we think about building it.
  • WHERE/HAS/GROUP BY Clauses = a way to filter a query returned rows to only a matched set. In Dgraph this concept is simply called Filters. Filter capabilities are very dependent on the implemented resolver layers or the query language used by the GraphDB. This is probably the biggest difference from RDBs in terms of implementation because most SQL-like RDBs use very similar clauses and functions, which is not the case for competing GraphDBs.
  • Extra: Directives = a special command used by the GraphDB query language to pass additional logic and commands to take various actions on queries and mutations. Dgraph uses Directives in the schema to generate the desired GraphQL API and apply special logic, authorization rules and indexing.

Stop Table Counting

In a RDB all of the schema is usually hand crafted. This means that every table, and every column type is meticulously crafted to produce a schema that can be queried as needed by an API layer. A NoSQL DBs leaves much of this open for interpretation on the fly which has some advantages and some disadvantages. A DB administrator is usually very particular with how many tables his schema contains. The more the tables, the more JOINS will be needed to get to the associated data. If a relationship will always be a one-to-one relationship, it is common to add columns instead of another table. It is cheaper for a RDB to make a selection of any set of columns from a single row than it is to join two rows from two separate tables together. In a GraphDB, this reasoning is no longer an issue. As any Node can have any Predicate and any Edge we no longer need to worry about having too many joins. Continue reading for a better understanding…

Happy Joining

Don’t be afraid of thinking deep in a GraphDB. As discussed above, a RDB can struggle with joins, but this is where a GraphDB thrives! A developer transitioning from a RDB to a GraphDB should change his thinking from wide schema to deep schema. Want to make a tree of 10 layers deep? In a RDB, that would require making 10+ joins (depending on the type of relationships with Pivot Tables), but in a GraphDB, it becomes the menial task of asking for 10 linking edges in a single query. Most developers making a GraphQL API layer on top of a RDB will struggle here with JOIN complexity and create the dreaded N+1 problem by make 10+ separate queries and heavily rest upon concurrent queries using the crutch of IN(...) to get the children of each parent level. The N+1 problem is non-existent in a GraphDB.

Marry Anybody and Everybody

In designing RDBs every relationship has to be perfectly thought out before implementation. Will this be a one-to-one, one-to-many, or many-to-many? Are pivot tables necessary? Is there any information about the relationship that needs to be store in the pivot table? What relationship data should the pivot table accommodate? It gets quite complex. In a GraphDB thinking deep instead of wide, we can make relationship Types and add relationship properties and link from anything to anything. This opens up a world of possibilities like making reverse look-ups that were previously very costly in a RDB.

Forget Your Name - Rethinking Namespacing

In a RDB, systems are usually namespaced on the same server usually by a database name. With DGraph, we can handle namespacing several ways. I prefer to handle namespacing inside of the data to keep my related data connected yet separate. For a system that needs to handle users, their payments, and then their data, it could be setup to with auth rules to limit data by ownership edges or even user type or user roles. This helps keep the data all in a single space to enable the blissful deep graphs needed many times for management purposes and keep respectfully private data still private. No longer do I have to query across tables and join across databases as I did before with a RDB.

Stand on Your Head

Sometimes the easiest way to get a quick grasp on a graphDB is to think upside down from traditional RDB. Continue reading for further explanation.

Think filtering before field selection

The query syntax of most RDBs is to select the fields add joins to get related data, and then apply where clauses. In a graphDB though it is better to think of what data you want in terms of filters and directives such as cascade then work on getting the predicated amd sub graphs that you need.

Think grouping before field selection

Just like filtering, grouping is usually an after thought on RDB queries. However with graphDBs, it os better to start the root of the query from the type you want to group by.

Flipping queries for better performance

In an RDB, there is usually only a single way to get a result set. This is not the case with a graphDB. You can literally flip the graph upside down. For instance, if you want to get all posts created by an author in a RDB you would query the posts table and then add a where clause to look for an author id. In a graphDB though it is better to get the author in the root query and then traverse down an edge to posts Authored.

Schema Loose

To sum it all up, let the schema be free to change. In most RDBs schema is highly controlled and asking for a change takes an act of congress. In a graphDB you can run an ever changing and mutating schema that matches each specific use case as the needs arise.

11 Likes

here you go @chewxy

Thanks for this. Still reading, but when I read this

Yikes.

1 Like

how about the performance? any benchmark?

this really depends on the complexity of the query and the filters. If we are talking subgraphs vs. JSON columns in MySQL, dgraph blows it out of the water (by being better). We did some smaller comparisons but I don’t have any of that data anymore.

We wondered if a single row query by id with all columns was faster with MySQL, and I thought it would be, but that was not the case either. These rows had JSON fields and we were just selecting the entire JSON string without any functions.

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.