SQL is slowing you down - Dgraph Blog


(Manish R Jain) #1

When writing this post, I was hesitant. SQL is so important that every one who has a degree in CS (or not) is technically a SQL developer. Complaining about SQL is a good way to lose friends and get trolled. I don’t want to say that you should stop using SQL, or that SQL is bad or doesn’t perform, but I also don’t agree that SQL should be an obvious choice for developers. This post sheds light on why I think a graph database like Dgraph is a better choice for developers today.

We have been taught, conditioned, trained to use SQL all our lives as engineers. It was there in schools, there when we went to college. It was being used at the company that we joined. It was such a common interview question that it no longer is. We don’t have just one, but an array of SQL databases to choose from. MySQL was released 22 years ago, in 1995 (youngest engineer at Dgraph was born the same year). PostgreSQL was released one year later. In contrast, MariaDB was released in 2009.

What’s amazing about SQL is its sheer simplicity. Select something where something, order by something, limit by some number. This query probably represents most DB queries run on the planet. You can achieve the same thing with a single bash command over CSV files. I joke that SQL DBs are a sort of glorified CSV file servers.

But, that’s also where lies its weakness. Its sheer simplicity. Because SQL does fewer things, application developers need to write more logic.

Issues with SQL

SQL is flat. It does joins, but nobody I know runs them in production at scale. SQL joins are slow under high traffic. In fact, companies go to lengths to avoid doing a join in production. Their tables are littered with pre-computed counts and data duplication to avoid touching two tables to run one query. So, for all practical purposes, you can say that SQL doesn’t do joins.

SQL doesn’t do recursion. Most web applications need that. Take Stack Overflow for e.g. You start with a question, then retrieve its answers, tags, comments. Then comments on answers, then upvotes, downvotes. Then, the author and editors of the question, the answers, the comments. It’s a lot of recursions. All that retrieval logic needs to be baked into the backend.

SQL enforces strict schema. Modern development is a very iterative process. You build one feature, then build another one on top, then another on the side. And that requires frequent changes to columns or column types. This becomes particularly hard in a live system with SQL which enforces strict schemas. Not only would modifying a column type require rewriting the entire table, which would block everything including select during this move; the older code would not work with an altered table, requiring you to potentially restart the system. This ugliness is just something that most teams have learned to live with (and not question).

The alternative

Dgraph does joins. Being a graph database, join is a first-class citizen in Dgraph. It’s optimized for expanding, intersecting and combining results from multiple entity types very fast. In fact, the internal mechanisms are exactly how search engines work.

Dgraph does recursions. Recursion is essentially an edge traversal. Traversing edges is very integral to Dgraph. Expanding out from a node is an O(1) time complexity operation, a look up in hash map, or a single read from disk.

Dgraph supports flexible schema. Dgraph separates storage layer from schema layer. A developer can choose how much schema enforcement they want. Modifying data types do not require data rewrite; and because (the equivalent of) tables are sparse in a graph database, one does not need to fit everything into predefined columns, or deal with NULLs.

Stack Overflow

To put something concrete up for discussion, we’ll see how data gets arranged in SQL and compare that with how it would be arranged in Dgraph. Folks at Stack Exchange generously provide dumps of their data, and the schema that they use run the site over SQL.

We’ll pick the schema that they use to store and serve the data that millions of developers browse through every day, and compare that against how it would be if Dgraph were the underlying database instead of SQL.

Let’s start with the User schema, and see how SQL would compare with Dgraph.

User

SQL Dgraph Reputation Reputation CreationDate CreationDate DisplayName DisplayName LastAccessDate LastAccessDate Location Location AboutMe AboutMe Age Age Views Views UpVotes DownVotes

Here we have user’s reputation, creation date, display name, last access date, location, about, age (not date of birth?) and views (not tracking every single page access, just a counter). All those are base data, i.e. new and unique data that we must store.

Then, the table has counts of upvotes and downvotes. But, those are being stored elsewhere in Vote table. These fields here are pre-computed counts to avoid doing joins with Vote table when querying for a user.

Dgraph doesn’t need to store these. It can do counts and sorts on them via queries efficiently.

Let’s look at the Versioning and Post table, where most of the action happens. These tables store the questions, answers, tags, ownership, and authorship information.

Versioning

SQL Dgraph TypeId Type PostId Post (point to Post) CreationDate CreationDate Author Author Text Text

Post

SQL Dgraph TypeId Type ParentId Has.Answer AcceptedAnswerId Chosen.Answer Title (duplicate of versioning) Title (point to Version) Body (duplicate of versioning) Body (point to Version) Tags (duplicate of versioning) Tags (point to Version) OwnerUserId Owner (point to User) Score Score LastEditorUserId LastEditDate LastActivityDate CommentCount AnswerCount FavoriteCount CreationDate

Version table is pretty straightforward. It stores type, post id, creation date, author, and text. Instead of storing ids, Dgraph creates a relationship from Version to Post.

Post table is the interesting one, and shows how much data duplication needed to happen to avoid joins. We have the text of title, body, and tags from the latest version stored in this table. In fact, the latest version isn’t even written. Version only gets created when a user modifies the title, body or tags. And because of this requirement to proxy for Version table, Post table stores last editor, last edit date, which would become the author and creation date respectively when moved to Version table.

All these fields are unnecessary in Dgraph. Dgraph can directly store an edge to the Version node which contains the text of title and body. We chose to store tags differently. For each tag, we create a node and create an edge between the post and the tag. This allows us to run aggregations over tags, and find the most popular and most related tags easily.

Then, we have pre-computed counts of comments, answers, and favorites. Dgraph can generate all this at query time efficiently; using them to do sorting as well. So, they don’t need to be pre-computed in Dgraph.

CreationDate and LastActivityDate are both redundant information, that can be deduced by doing recursion. They form good candidates for optimization later on, but are not needed and hence are not present in Dgraph.

Comment, Vote

SQL Dgraph PostId (Vote/Comment -> Post) (New edge from Post -> Vote/Comment) Timestamp Timestamp Author Author VoteType (for Vote) Score (for both Vote and Comment) Text (for Comment) Text (for Comment)

Finally, we have the comment and vote tables. Because SQL doesn’t store lists of things in a single row, you must create another table and point it back to the original row. A post contains comments. But in SQL, it’s the reverse. A comment table row points to the post. This is such a common hack that it’s no longer considered one.

In Dgraph, a Post has an edge to the vote or comment, not the other way round.

This is how the final Dgraph schema looks. It’s a lot simpler than SQL.

In fact, SQL schema complexity increases rapidly as the relationships among data increase. Each relationship would inadvertently require joins, which would then need to be avoided by duplicating more information across tables. All this adds a significant amount of application logic to deal with.

Conversely, in a graph database like Dgraph, schema is a pretty close representation of your mind map.

Queries

In the previous section, we saw that Dgraph schema is a lot simpler than the equivalent SQL schema. In this section, we’ll see why that is the case, and how Dgraph’s GraphQL inspired query language makes it easy to render various Stack Overflow (referred to as SO) components.

Disclaimer: While reading the queries, note that these queries are just our approximations of how SO works. We don’t work there, and thence, certain things like ranking formulas are just put together to represent the power of the query language. It’s not what SO uses.

Let’s start with the home page. We want to render 100 questions, and all the associated meta information that SO shows.

Question Div

This is the corresponding query fragment to retrieve the data to render one question div.

_uid_  # Unique identifier
UpvoteCount: count(Upvote)      # Counts the number of Upvote edges.
DownvoteCount: count(Downvote)  # Counts the number of Downvote edges.
AnswerCount: count(Has.Answer)  # Counts the number of Answer edges.
ViewCount  # Retrieves the ViewCount property stored.
Title { # Title points to Version node.
  Text  # Retrieves the text of the title.
}
Owner { # Owner of the question, points to the User node.
  DisplayName  # User's display name, reputation, and identifier.
  Reputation
  _uid_
}
Tag {  # Points to multiple Tag nodes.
  TagName: Tag.Text  # Retrieves the text of the tag.
}
Has.Answer(orderdesc: Timestamp, first: 1) {  # Picks the most recent answer.
  Timestamp  # Retrieves Timestamp of the Answer.
  Owner {    # Retrieves the Owner of that Answer.
    DisplayName  # Retrieves display name, reputation, and identifier.
    Reputation
    _uid_
  }
}

Now that we have the question fragment let’s see how to retrieve a list of questions depending upon the various tabs that SO shows.

Latest 100 questions

# All nodes of type question, in descending order by Timestamp, limit by 100.
questions(func: eq(Type, "Question"), orderdesc: Timestamp, first: 100) {
  ${questionFragment}
}

Top 100 hot questions

# Get the latest 1000 questions.
var(func: eq(Type, "Question"), orderdesc: Timestamp, first: 1000) {
  Has.Answer {            # Get their answers
    uv as count(Upvote)   # Count the Upvote edges.
    dv as count(Downvote) # Count the number of Downvote edges.
  }
  ac as count(Has.Answer) # Count the number of Answers.
  cc as count(Comment)    # Count the number of Comments.
  uv1 as sum(var(uv))     # Sum up all Upvotes across all Answers.
  dv1 as sum(var(dv))     # Sum up all Downvotes across all Answers.
  # Put together a rough formula to calculate a final score per question.
  score as math(0.7 + ac * 0.2  + (uv1 - dv1) * 0.4 + (cc) * 0.4)
}
# Order the 1000 questions in descending order of score, pick the first 100.
questions(id: var(score), orderdesc: var(score), first: 100) {
  ${questionFragment}
}

Top Tags

To get the most frequently used tags, we can run this query.

t as var(func: eq(Type, "Tag")) {  # Retrieve all tags. Assign it to variable t.
  c as count(~Tag)  # For each tag, count the number of incoming edges.
                    # This gives us the number of times each tag is used.
}
# For all tags in t, order them by count c and pick the first 10.
# This gives us the ten most popular tags.
topTags(id: var(t), orderdesc: var(c), first: 10) {
  _uid_
  TagName: Tag.Text      # Retrieve the text of the tag.
  QuestionCount: var(c)  # Reusing variable c, retrieve the number of times tag is used.
      # Note that only questions have tags. Hence we use QuestionCount alias.
}

Depending upon the number of tags, this query might be a tad slow. The most popular tags don’t change that often. So, this is a good candidate for caching in application. But, even so, the amount of application code that needs to go in to generate the list of most popular tags is very little (run the query, cache the results).

Now let’s see how to render the question page. This is one complex page, with many components, touching almost all the tables in SQL. We’ll see how we can get all the information in a single query to Dgraph.

Before we dig into the question query, let’s create a bunch of reusable fragments.

Version Fragment

Author {       # Retrieve Author node.
  DisplayName  # Retrieve their display name, reputation, and identifier.
  Reputation
  _uid_
}
Type  # Retrieve type, text and timestamp of Version.
Text
Timestamp

Comment Fragment

_uid_
Author {       # Retrive Author node.
  _uid_        # Retrieve their display name and identifier.
  DisplayName
}
Text  # Retrieve Text, Score and Timestamp of Comment (no versioning here).
Score
Timestamp

Answer Fragment

# Answer is represented as a Post node.
_uid_
Body {  # Retieve body Version node.
  Text  # Retrieve text of Body.
}
# This is used for "answered x time ago."
Owner {        # Retrieve owner User node.
  DisplayName  # Retrieve their display name, reputation, and identifier.
  Reputation
  _uid_
}
Timestamp  # Retrieve timestamp and type of Post.
Type
UpvoteCount: count(Upvote)     # Count the number of Upvote and Downvote edges.
DownvoteCount: count(Downvote)
# Post edge takes us from Version -> Post. ~Post is the reverse of that, taking us from Post -> Version. We order the Version nodes in desc order of Timestamp and pick the first 1. In other words, pick the latest Version of Answer.
# This is used for "edited x time ago."
History: ~Post(orderdesc: Timestamp, first: 1) {
  ${HistoryFragment}
}
# These are comments on answer. Question also have comments which are retrieved in the query below.
Comment {             # Get the Comment nodes.
  ${CommentFragment}  # Retrieve further details per Comment using fragment above.
}

Question

{
  var (id: ${questionUID}) {  # Retrieve the question node.
    Has.Answer {              # Retrieve the answers.
      uv as count(Upvote)     # Count the number of Upvote and Downvote edges.
      dv as count(Downvote)
      answer_score as math(uv - dv)  # Calculate an score per answer based on votes.
    }
  }
  question(id: ${questionUID}) {  # Retrieve the question node.
    _uid_
    Title {  # Retrieve the title Version node.
      Text   # Retrieve the text of the title.
    }
    Body {   # Retrieve the body Version node.
      Text   # Retrieve the text of the body.
    }
    Owner {        # Retrieve the owner of the question.
      DisplayName  # Retrieve their display name, reputation, and identifier.
      Reputation
      _uid_
    }
    ViewCount  # Retrieve some question properties.
    Timestamp
    Type
    UpvoteCount: count(Upvote)  # Count the number of Upvote and Downvote edges on question.
    DownvoteCount: count(Downvote)
    questionTags as Tag {  # Retrieve the question Tag nodes.
      TagName: Tag.Text    # Retrieve their text.
    }
    AnswerCount: count(Has.Answer)  # Count the number of answers.
    Has.Answer(orderdesc: var(answer_score)) {  # Retrieve the Answer nodes, and order them by the score calculated above.
      ${AnswerFragment}  # Retrieve answer details using fragment above.
    }
    Comment {  # Retrieve Comment nodes.
      ${CommentFragment}  # Retrieve comment details using fragment above.
    }
    History: ~Post(orderdesc: Timestamp, first: 1) {  # Just like in Answer, retrieve the latest Version of question.
      ${HistoryFragment}  # Retrieve Version details using fragment above.
    }
  }
  # In the following query block, we use the tags of this question, to get related questions, and some minimal details about them; essentially title and vote counts.
  tags(id: var(questionTags)) {
    relatedQuestions: ~Tag(first: 10) {
      _uid_
      Title {
        Text
      }
      UpvoteCount: count(Upvote)
      DownvoteCount: count(Downvote)
    }
  }
}

If the query above looks complex, don’t worry about it. Just read the comments, and get a sense for how we retrieve each piece of data that we need to render on the page. SO question page as mentioned before is pretty complex requiring recursions to retrieve all details of questions, comments, votes, owner, last editor. Similarly, we need answers, their comments, votes, owner, last editor. On top of that, we can sort the answers displayed by a score generated from vote counts. We can even limit the number of comments displayed if required.

With SQL DB, all of this logic would have needed to be baked into the application. But, with Dgraph, the query language is so powerful that it can all be generated right within the database: doing recursions, computing counts, putting them to generate scores and then sort by these scores. This cuts down heavily on the application code, letting a developer focus on the feature set and the best way to render the data. And that’s powerful!

Full Text Search

Dgraph also supports regular expression and full-text search in many languages. You can create an index on string edges, to achieve this. In this case, we created a full-text index on the Text edge.

Text: string @index(fulltext) . line from schema.txt.

{
  var(func: anyoftext(Text, "${searchQuery}")) @cascade {
  # All Version nodes which have any of the text in search query.
    p as Post {  # Retrieve their Post, assign it to variable p.
      # Pick the latest Version of the Post and check that it does have the search query. This is a bit of a shuffle and can be avoided by having a direct link from Post to its latest version.
      ~Post(orderdesc: Timestamp, first: 1) @filter(anyoftext(Text, "${searchQuery}"))
    }
  }
  # Pick the first 25 such posts.
  posts(id: var(p), first: 25) {
    _uid_
    Type
    question: ~Has.Answer {  # If this post is an answer, retrieve the question by traversing the Has.Answer edge in reverse.
      _uid_
      Title {
        Text
      }
    }
    Title { # Title Version would be present if this Post if Question.
      Text
    }
    Body {  # Body Version would be present if this Post is Answer.
      Text
    }
    # The rest of the fields are as explained before, and are from the perspective of the question Post node.
    Owner {
      DisplayName
      Reputation
      _uid_
    }
    Tag {
      TagName: Tag.Text
    }
    Chosen.Answer {
      Owner {
        DisplayName
        Reputation
        _uid_
      }
      Timestamp
    }
    UpvoteCount: count(Upvote)
    DownvoteCount: count(Downvote)
    AnswerCount: count(Has.Answer)
    ViewCount
    Timestamp
  }
}

Summary

In this post, we showed you how a SQL schema compares against the equivalent schema in Dgraph, basing our analysis on Stack Overflow data. We then picked pages from Stack Overflow and showed you queries to retrieve all the data necessary to render those pages.

As you can see, Dgraph schema is significantly simpler than SQL, utilizing zero or little schema hacks, pre-computation or duplication of data. Also, Dgraph query language is way more powerful than SQL, handling most of the complexity of recursive data retrieval, ranking and sorting within the database, allowing the application to focus solely on the feature set and rendering of data.

Dgraph being a graph database provides sparsely populated fields and easy schema manipulation, which allows developers to modify data types as they iterate over their application, removing the need to do complex upgrades of the entire system.

Not only that, Dgraph makes it a lot faster for developers to iterate. Modifying Dgraph queries is a lot cheaper operation for a developer, than modifying the backend code; and Dgraph provides a nice UI to allow for this.

Finally, Dgraph supports regular expressions, term matching, full-text search and equality matching for string types. All Dgraph edges are unidirectional, but Dgraph allows you to specify creating reverses automatically, to aid in data retrieval. It also supports indexing various other data types, like int, float, datetime, etc.; very useful to build applications today.

Links

All the data generation, querying and rendering code is located at Dgraph’s Graph Overflow repository. Graph overflow site is still a work in progress, and we’ll put it up at dgraph.io soon.

Top image: Dragon is a next-generation spacecraft designed to take humans to Mars.


This is a companion discussion topic for the original entry at https://open.dgraph.io/post/sql-vs-dgraph/

(Ryanackley) #2

So I got to this point in your article:

SQL is flat. It does joins, but nobody I know runs them in production at scale.`

I’m not a SQL nut by any means and some of your article hits the mark but I immediately scrolled down to the comments section to respond to this without reading anything else. It’s a very bold statement because a relational database wouldn’t be relational without joins. It’s one of the things they are designed around. So, what you are essentially saying is that the entire premise behind relational databases has been a failure?

I think you need to go back and qualify or clarify that statement in your article (or maybe just remove it). What do you consider “at scale”? I can get 1000s req/sec against a mysql database running a 2 or 3 table join query where the db is not even breaking a sweat. This is running in a cloud VM on a pretty basic setup. It’s not Google or StackOverflow scale but I would imagine most of your potential customer aren’t either.


(Manish R Jain) #3

The reason I wrote that SQL joins aren’t being run in production at scale, was informed by how Facebook, Stack Overflow and Quora are doing things. But, the fact that many people have reacted so negatively to my assessment clearly represents that they run SQL joins in production, and they’re happy with how it works.

I’ve removed that section of the post entirely. The idea was to focus on the alternative (Dgraph), with an assessment of SQL that I felt people would agree with. That particular section was doing the opposite.

Dgraph has clear benefits compared to running an RDBMS. It provides all the features that an RDBMS provides, but also allows you to do more on top of it; and that the intended focus here.


(Dennis Walsh) #4

First off, really good post!

I’m a business owner, and my backend is Elixir/Absinthe (GraphQL)/Postgres. I don’t have experience with Dgraph, but I do have limited experience with Neo4J. Every time I read a graph database article, I feel a siren on the shores calling to me. “Look how easy it is. Look how intuitive it is. It’ll treat you right, just give it a chance.” But then I come across “why we went back to Postgres” and hear a litany of performance nightmares.

Correct me if I’m wrong, but it seems like you’re arguing to use Dgraph as your main data store. Even Neo4J doesn’t claim that it’ll replace your main data store. In fact, one of their demos shows MySQL and Mongo along side Neo4J running the recommendation engine. (But then even on recommendation engines, you’ll read another article that’s it’s harder to set up the engine in Postgres, but it’s far more performant when done. Ugh) Is there a claim that Dgraph is performant (at least in the same ball park) with, say, Postgres, for rudimentary reads?

If one could actually make a graph database performant with the likes of 25 year optimized MySQL/Postgres/etc, that would be stunning and get my vote. (Hell, it’d still get my vote for recommendation engines that were somewhat performant) But I don’t have the 1000 free hours to validate the claims.

Anything you’d like to type on the topic, I’d love to read.


(Manish R Jain) #5

Hi @LawJolla,

Thanks for bringing up a real issue. I can totally understand where you’re coming from. I’d compare Graph space today to electric car market when Tesla began. You had these failed attempts, nobody thought electric cars are going to see serious usage, there were mileage problems, infrastructure problems, and a general disbelief in the whole technology. When gas is so cheap, why bother with this whole electric stuff.

Graph space today is very alike electric car space then. We have some existing companies who’ve done their bits to extend graph market, but not in a convincing way. Their performance is bad, or their design has flaws. It’s not really built in a way, where someone can just pick it up and it would work for them.

We decided to build Dgraph to solve this problem, once and for all. It’s really meant to be a primary database, as such, durability, performance, scalability – they’re all very important to us. Performance, in particular, is something we look at very deeply all the time. We haven’t benchmarked Dgraph against Postgres, but we have queries on Stack Exchange sites and run them on the same data on Dgraph, and we’re at least as performance on those. One of our team members is going to write a blog post about it.

I’m inclined to just say, “challenge accepted!”. But, I’ll back it up. A better design can bring better performance. We saw that with Badger (our KV store) beating RocksDB (KV store in use by Google and Facebook) in random lookup latency; because we based it upon latest research. Similarly, Dgraph is based on a design that’s meant to work for web-scale – internally, it works very much like a search engine. During those 25 years, a lot of research has been done in this space. It’s hard to change existing designs to use all that research, but when you’re building something from scratch, you can effectively use it, to beat something older.

I don’t think of Graph DBs as a niche – I think Dgraph can run your applications. You can see our performance comparisons against Neo4j, in an older blog post. We’ll also be documenting loading the entire stack overflow on Dgraph, in an upcoming post, which might be of interest to you.


(Dennis Walsh) #6

I’m excited to see it. There’s no question that SQL databases accumulated unpurgable technical debt over the years, but they also accumulated decades of optimizations from some of the best and brightest in computer science. In a world where tech changes every 6 months, there’s a reason SQL databases remain – they got an incredible amount right and they’re optimized.

I’d love to see real world benchmarks that Dgraph is somewhat competitive with SQL in mundane database reads and writes – the majority of the CPU cycles for any primary database. As for Dgraph’s other advantages over SQL, you had me at hello! The graph mental model, joins as a truly first class citizen, simple relationship queries, and the ability to have a fully normalized database because you don’t need to fear the JOIN is exciting x100! (It drives me nuts that I need to store things like “post count” as a separate field rather than simply returning the count of stored posts!)

And a little off topic, but since we’re here, I’d also like more blog posts about Dgraph’s business model. About a year ago, I (and many others) invested serious resources behind RethinkDB. It was (and still is, really) the best document DB. But now that they’re out of business, and it’s too risky to back (until they get a Postgres like core team and following).

I really appreciate your willingness to engage and I look forward to trying out Dgraph in some of my projects!


(Manish R Jain) #7

I wonder if it is because MySQL / Postgres is so good, or is it because the others aren’t. I don’t think their success is due to their own goodness but just lack of alternatives for a long time. Something like MongoDB only started in 2009.

On a tangential point: Programming languages world saw a lot of competition for incumbents. Perl, Python, Go, Scala, JS, what not. Every other year there was a new language to a point where the absolute performance of C/C++ isn’t what’s driving people towards adoption. They want ease of use, fun of working with these newer languages. I think in the database world, that has started to happen now.

We’re trying to integrate with LDBC benchmarks. That should give us a good idea for the performance in comparison with RDBMS.

Exactly!

Dgraph follows MongoDB licensing system: AGPLv3 server, Apache 2.0 clients. On top of that, we plan to add a lot of enterprise features (more on that after v1.0), including a bunch of machine learning plugins to allow plug-and-play behavior.

Also, talking about RethinkDB, I read the postmortem. And the biggest thing which struck me was the perfectionist attitude, which caused them a 3-year delay to release 1.0. We want to avoid that. Already Dgraph is better than other graph databases, in performance, in functionality – so we’re now freezing the feature addition, and focusing on performance, stability, and reaching v1.0. We have a clear differentiation factor from other graph DBs, and as one of our users puts it, “Dgraph is at least 10x better than all others.”

To add more context to my positivity, companies are already paying us to help integrate Dgraph in their systems – and this is all pre 1.0; so there’s definitely a need here.


(Dennis Walsh) #8

Thanks for your responses and I look forward to the numbers!