Dgraph is rapidly gaining reputation as an easy to use database to build apps upon. Many new users of Dgraph have existing relational databases that they want to migrate from. In particular, we get asked a lot about how to migrate data from MySQL to Dgraph. In this article, we present a tool that makes this migration really easy: all a user needs to do is write a small 3 lines configuration file and type in 2 commands. In essence, this tool bridges one of the best technologies of the 20th century with one of the best ones of the 21st (if you ask us).
Two ways of understanding the universe
Before we dive into the details, let’s quickly glance over the data models of SQL databases and Dgraph. Real life objects have certain attributes: a person has a name, age, and date of birth; a movie has a name, release date, and genre. SQL databases represent all the attributes of an object by storing them into a single row in a table. Dgraph, on the other hand, represents these attributes as edges around a node.
Both systems work equally well if we stop here and consider attributes of objects as the only type of relationships in the universe. However, the real universe is more complex than that. The 2nd type of relationship is between objects.
As a concrete example, an actor is a person that has starred in a number of movies; a movie has one or more directors, one or more writers, and a generally larger number of actors. In order to logically link a concept (e.g. a movie) with another (e.g. an actor), a SQL database would normally have separate tables to represent the relationships. Thus, finding the names of the people that starred in a given movie requires:
- locating the movie identifier
movie_id
by searching for its name in themovie
table, - finding the identifiers of the people who starred in that movie
person_id
from a secondacting
table, and finally - finding the rows in the
people
table with thepeople_id
s above and extracting the values in thename
column.
If that seems already cumbersome, you might now want to consider how much more difficult it can get
for recursive relationships. Imagine there is a people
table, and a children
table mapping each
person_id
to the person_id
s of their children. The logic to find all the descendants of a person
with SQL is not trivial and requires a bunch of back and forth interactions between the application and the
database.
Graph Databases come to rescue in situations where SQL would have needed many nested join
operations: in Dgraph, inter-object relationships are represented by edges and edge traversal from
one node to another is supported as a first-class operation. This seemingly simple difference
makes an entirely new class of operations possible. Finding all people who are related to a movie
regardless of how they are related (using the expand(_all_)
operation), finding all of a person’s
descendants regardless of how many generations they are apart (using the @recurse
directive), finding the
shortest connection path between two people in a social network (using the shortest
function),
etc. The possibilities are limitless.
If you feel Dgraph is the right fit for your use case, let’s dive a little deeper and see how the migration tool works by studying an example. Below, you can see a simplified version of the database schema supporting StackOverflow:
create table users (
Id int,
DisplayName varchar(200),
PRIMARY KEY (Id)
);
create table posts (
Id int,
AcceptedAnswerId int NULL,
Body text,
OwnerUserId int NULL,
LastEditorUserId int NULL,
Title varchar(1000),
PRIMARY KEY (Id),
FOREIGN KEY (OwnerUserId) REFERENCES users (Id),
FOREIGN KEY (LastEditorUserId) REFERENCES users (Id),
FOREIGN KEY (AcceptedAnswerId) REFERENCES posts (Id)
);
create table comments (
Id int,
PostId int,
Text varchar(1000),
UserId int,
PRIMARY KEY (Id),
FOREIGN KEY (PostId) REFERENCES posts (Id),
FOREIGN KEY (UserId) REFERENCES users (Id)
);
create table votes(
Id int,
PostId int,
CreationDate varchar(100),
PRIMARY KEY (Id),
FOREIGN KEY (PostId) references posts (Id)
);
Each of the four tables stores objects that have their own attribute(s):
- A user in
users
has aDisplayName
; - A post in
posts
has aTitle
and aBody
; - A comment in
comments
has someText
; - And a vote in
votes
has itsCreationDate
.
Besides the basic attributes, the objects are also interconnected:
- A post in
posts
has an owner and hence a columnOwnerUserId
referencing a row in theusers
table. - A comment belongs to a post and hence the
comments
table has aPostId
column referencing a row in theposts
table.
To illustrate the migration process, we’ve selected some sample rows from these tables. We will walk through each table, and explain how they can be converted into N-Quad entries. The RDF N-Quad format is one of the formats that can be used for importing data into Dgraph (the other accepted format is JSON.)
# users
| Id | DisplayName |
|-------|-------------------|
| -1 | NULL |
| 14 | Jimmy Hoffa |
| 15 | Mooseman |
| 2089 | MrPhooky |
| 12370 | Paul Wesselkamper |
# posts
| Id | AcceptedAnswerId | Body | OwnerUserId | LastEditorUserId | Title |
|-------|------------------|-----------|-------------|------------------|-----------|
| 9 | 11075 | <p>I try | 14 | 2089 | How can.. |
| 11075 | NULL | <p>When I | 12370 | -1 | |
# comments
| Id | PostId | Text | UserId |
|----|--------|---------------------------------------|--------|
| 4 | 9 | Of what fabric are the blankets made? | 15 |
# votes
| Id | PostId | CreationDate |
|----|--------|-------------------------|
| 10 | 9 | 2014-12-09T00:00:00.000 |
Converting SQL rows to RDF N-Quad entries
N-Quad entries have the format <subject> <predicate> <object> .
, so the single row in the comments
table is converted into the following N-Quad entries:
_:comments.4 <comments.Id> "4" .
_:comments.4 <comments.Text> "Of what fabric are the blankets made?" .
JSON equivalent (for reference):
{
"uid": "_:comments.4",
"comments.Id": 4,
"comments.Text": "Of what fabric are the blankets made?",
}
Take the first entry _:comments.4 <comments.Id> "4" .
as an example:
-
_:comments.4
is the subject, -
comments.Id
is the predicate, -
"4"
is the object, and finally -
.
is the terminal symbol of each N-Quad statement.
In Dgraph, a subject that begins with _:
is called a blank node label, and it represents an object that is still to be created.
Thus the two N-Quad entries tell Dgraph that
- a new object needs to be created for the comment,
- the new object has two attributes
comments.Id
andcomments.Text
, whose values are"4"
, and"Of what..."
respectively.
Notice that the blank node label for a given object must be unique across all the generated N-Quad
entries. To ensure uniqueness, we leverage the fact that the values in the primary key columns of a table
are unique so we construct the blank node labels by appending the primary key values after the table
name. The row in the comments
table whose primary key is 4 has the blank node label
_:comments_4
. However some SQL tables do not have any primary keys. To solve the problem, we would
maintain a counter to generate unique labels for each row in such tables.
Another interesting design decision is that we dropped the values in the PostId
and UserId
columns, because they should be represented
using edges in Dgraph. We will explain this in more details a little later.
Going back to the four N-Quad entries, it demonstrates the first and second rule for the conversion:
(1) To generate unique blank node labels, the values in the primary key columns are appended to table names if the table has a primary key; otherwise an increasing counter will be used.
(2) Each SQL table cell except those in foreign key columns is stored under a predicate name, which is constructed by concatenating the table name and the corresponding column name.
Similarly the 3 other tables “posts”, “users”, and “votes” can be converted by following the two rules:
_:posts.9 <posts.Body> "<p>I try to use..." .
_:posts.9 <posts.Id> "9" .
_:posts.9 <posts.Title> "How can I keep 2 blankets together on a bed?" .
_:posts.11075 <posts.Body> "<p>When I was in..." .
_:posts.11075 <posts.Id> "11075" .
_:posts.11075 <posts.Title> "" .
_:users.-1 <users.DisplayName> "" .
_:users.-1 <users.Id> "-1" .
_:users.14 <users.DisplayName> "Jimmy Hoffa" .
_:users.14 <users.Id> "14" .
_:users.15 <users.DisplayName> "Mooseman" .
_:users.15 <users.Id> "15" .
_:users.2089 <users.DisplayName> "MrPhooky" .
_:users.2089 <users.Id> "2089" .
_:users.12370 <users.DisplayName> "Paul Wesselkamper" .
_:users.12370 <users.Id> "12370" .
_:votes.10 <votes.CreationDate> "2014-12-09T00:00:00.000" .
_:votes.10 <votes.Id> "10" .
Note: the column formatting was added for your reading, the migration tool would not add the extra whitespaces.
Besides the basic attributes, we also need N-Quad entries to represent the inter-object relationships.
Consider how we converted the PostId
and UserId
foreign keys in the comments
table:
_:comments.4 <comments.PostId> _:posts.9 .
_:comments.4 <comments.UserId> _:users.15 .
Taking the first N-Quad entry _:comments.4 <comments.PostId> _:posts.9 .
as an example. The
subject is a blank node _:comments.4
. The predicate comments.PostId
is again the table name concatenated with the
column name. The object _:posts.9
is different from a plain attribute value – it is the
blank node label of a row in the posts
table, and derived by looking up the value 9
in the posts
table.
From these entries come our 3rd rule of conversion:
- (3) Inter-object relationships are derived by following foreign key constraints, where the blank node labels in the N-Quad objects are calculated through lookups in the target table.
Following this rule, we can establish relationships between posts and users, between posts and posts, and between votes and posts:
_:posts.9 <posts.AcceptedAnswerId> _:posts.11075 .
_:posts.9 <posts.OwnerUserId> _:users.14 .
_:posts.9 <posts.LastEditorUserId> _:users.2089 .
_:votes.10 <votes.PostId> _:posts.9 .
Deriving a Dgraph schema from SQL
Besides converting the data, the migration tool also needs to derive the schema of each predicate. We have two simple rules for converting the schema:
- (1) For plain attributes, there is usually a one-to-one mapping between a SQL data type and the
Dgraph datatype. For instance, the
Body
column in theposts
table is of typetext
, and hence, the predicateposts.Id
is of typestring
:posts.Body: string .
- (2) The predicates representing inter-object relationships, e.g.
posts.OwnerUserId.
, simply have the type[uid]
, meaning following the predicate leads us to a set of other objects.
These two rules give us the following schema when converting the SQL tables above:
comments.Id: int .
comments.Text: string .
comments.PostId: [uid] .
comments.UserId: [uid] .
posts.Body: string .
posts.Id: int .
posts.Title: string .
posts.AcceptedAnswerId: [uid] .
posts.OwnerUserId: [uid] .
posts.LastEditorUserId: [uid] .
users.DisplayName: string .
users.Id: int .
votes.CreationDate: string .
votes.Id: int .
votes.PostId: [uid] .
Once the migration tool finishes, two files will have been created:
- an RDF file is created containing all the N-Quad entries, and
- a schema file.
The two files can then be imported into Dgraph via the Dgraph Live Loader
or Bulk Loader. It’s worth pointing out that sometimes you
may want to customize the schema, e.g. adding an index to a predicate, or by
changing an inter-object predicate from unidirectional to bidirectional links by adding the
@reverse
directive. If you would like such customizations, you should do it by editing
the schema file generated by the migration tool before feeding the files to the Live Loader or Bulk Loader.
Demo time!
It’s time to sit back, relax, grab something to drink and enjoy a demo with the migration tool in action:
The step-by-step instructions for running the demo can be found here.
Besides the demo above, we’ve tested the tool using a larger data set from the lifehacks site on StackExchange, which showed us that for a dataset with around 10,000 rows the migration takes around 12 seconds (might vary on your machine).
Summary
To recap, we presented the design of a migration tool that exports SQL tables into NQuad entries. To make this blog easier to understand, we have simplified the problem, and left out some tricky details such as multi-column foreign key constraints. As a result, the gist of the design is straightforward and can be summarized into 3 rules:
- To generate unique blank node labels, the values in the primary key columns are appended to the table names if the table has a primary key; otherwise a counter is used.
- Each SQL table cell is stored under a predicate name, which is constructed by concatenating the table name and the corresponding column name.
- Inter-object relationships are derived by following foreign key constraints, where the blank nodes in the N-Quad objects are calculated through lookups in the target table.
From there, the generated NQuad entries can be imported into Dgraph with the Live Loader or Bulk Loader.
If you find this tool useful or have any suggestions for improvements, please drop us a note at our discussion forum.
Top image:
This is a companion discussion topic for the original entry at https://blog.dgraph.io/post/migrating-from-sql-to-dgraph/