Dgraph v21.12: Zion - The Last City Standing - Dgraph Blog

We are excited to announce Dgraph v21.12 Zion release. Zion release has MAJOR performance optimizations, some of which were possible only by making changes deep into the core of the system. Let’s have a look.

Queries are >20x Faster

Dgraph query latency is a factor faster in this release. This is possible due to two longstanding tasks — 1. Introduction of Roaring Bitmaps, and 2. Using Posting List cache. Both of these have been in the works for a couple of years, with multiple iterations to get them done right.

For 1: We wrote Sroar — Serialized Roaring Bitmaps, our take on the standard roaring bitmaps implementation. Sroar replaced our dependence on two things: 1. Group varint encoding for storing uint64s. 2. Uint64 arrays in protocol buffers to send UIDs over the wire.

All the UIDs are now encoded in Sroar and are operated upon and shipped in the same format over the wire. Based on our benchmarks, this improves Dgraph’s performance by 10x for 50% of the queries and decreases memory usage by 10x under heavy workloads. To learn more about Sroar, read the blog post.

For 2: Applying Ristretto cache to multi-version posting lists with readers reading at different timestamps has been hard. We have attempted to marry this with transactional correctness multiple times unsuccessfully in the past.

It all changes with #7995. In this PR, we were able to achieve a transactionally correct cache implementation for the posting lists. We do this by registering the key as soon as the first read and then keeping track of the latest version (L1) associated with the key. A reader reading at a timestamp lower than the latest (R1 < L1) would NOT read/write the value to cache. Only a reader with a timestamp >= latest (R2 >= L1) would read/write the value to cache. This value would then serve all future readers with timestamp >= latest (Ri >= L1) until a new write happens (at L2) hence updating the latest timestamp (L2 > L1) — when the cycle repeats.

These two changes combined improved query latency by >20x for 50th percentile and >100x for 25th percentile of the queries, under the heaviest workloads we have seen in Dgraph Cloud.

Note that previous versions of Dgraph had a posting list cache flag present, it was turned off by default. In Zion, this flag is set to use 50% of the cache by default — the remaining 50% is used by Badger.

Writes are 30x faster

Improving write performance has been another thing that we wanted to solve for. In Dgraph, all writes are serialized via Raft and then applied in order via a single goroutine in applyCh. This is important to ensure that the state as seen by the leader and the followers is consistent. By applying the changes in the same order, they all reach the same state, without ever synchronizing their state.

However, this leads to slowness in writes. Dgraph generates the writes for the indices at the same time as the original write. A single expensive write touching many keys and many indices can slow down the entire sequence.

In this release, we reimagined how we can make this write process concurrent, while ensuring their serialized order. We still have a serialized list of writes (/transactions). But, each transaction is now processed concurrently, with their results stored in a local skiplist.

The applyCh goroutine processes these transactions serially, but now only has to verify that the results for the transaction writes are still “valid”. If they are still valid, it can move on — hence reducing its role from executor to an overseer. Validating the result is a much faster operation than executing the write. In some cases when they’re invalid, it would fall back to re-running the transaction writes to re-generate the skiplist.

Finally, when transactions are committed, these skiplists are merged and handed over to Badger via a new API called HandoverSkiplist (#1696). This function allows the caller to skip Badger’s write-ahead log, and directly hand over a Skiplist (Memtable) to Badger. This Skiplist can then be part of Badger’s compaction system and written to disk as an SSTable. Once the Skiplist is written, a callback provided by the caller is run — hence notifying the caller that the data has been successfully flushed to disk.

With these changes, we’re able to run transactions concurrently, improving 75th percentile latency for writes from 30ms down to 1ms (see #7694 and #7777).

In this change, we also got rid of ludicrous mode, which was riddled with bugs. Concurrent transactions provide the same performance as ludicrous mode, while also providing strict transactional guarantees.

Full Snapshot Transfers are 3x faster

Dgraph does synchronous replication via Raft protocol. Sometimes this requires sending snapshots over to another Dgraph instance. This could be because a new follower joined the group, or the follower fell way behind the others in the group, or a predicate moved from one shard to another.

Badger provides a nice way to send snapshots via Stream framework. It does so by splitting up the universe of keys into non-overlapping ranges, concurrently iterating over them, and generating batches of data to be transmitted over the wire.

While this process is fast, it is also expensive. Reading keys requires disk access, decryption, decompression, parsing of data, converting it to key-value pairs before transmitting over the network. The receiver then has to reverse this process. This spends expensive CPU cycles on both ends.

We made a change in Badger (see commit here), to allow sending entire SSTable files directly during snapshots. Sending files and writing files require very little work from both ends. The sender can just read the file off disk and send it across. The receiver can directly write the file to disk and make it part of its MANIFEST. Neither has to do any of the data parsing steps mentioned above. If the file is encrypted, the encryption key is also sent along to be added to the receiver’s key manager.

For a compressed and encrypted Dgraph instance, which is the most expensive to send a snapshot over, this change made snapshot transfers 3x faster with almost negligible CPU usage.

Restores at 150 MBps — 4x faster

Dgraph restores now use a Map-reduce style approach. Backups are read and processed to generate map files. Those are then read to generate the final p directory. This is similar to how the bulk loader works. See #7664, #7666, #8038.

With these changes, the map phase can run at 450 MBps, and the reduce phase can run at 130 MBps. This makes restore 4x faster compared to the last release, while also being more memory efficient.

Moreover, Dgraph can now also do incremental restores (see #7942), which are very useful to reduce downtime between upgrades. The majority of the restore can be done while the source cluster is still live and updating. A way to benefit from this is to

  1. Take a backup from a live cluster A
  2. Restore it to a new cluster B
  3. Once done, mark A as read-only
  4. Take a much smaller incremental backup from A
  5. Incremental restore to cluster B
  6. Move live traffic to cluster B and discard A

Faster Lambda Function Execution

Dgraph now forks multiple NodeJS dgraph-lambda processes and directs custom user Javascript lambda code to them in a round-robin fashion. This avoids sending data over the network, or even out of the box where Dgraph is running. Moreover, this mechanism of query distribution performs way better than using the NodeJS cluster module, providing thousands of QPS. See #7973.

Writes can happen during predicate moves

Historically, Dgraph has paused writes to a predicate which is being moved from Shard 1 to Shard 2. With this release, the predicate moves would do a move first, without blocking writes. Once done, it would then block the writes, do a much smaller incremental move and unblock writes. This reduces the write downtime associated with a predicate move from hours to seconds. See #7703.

Forbid Massive Fan-outs

Certain keys in the graph suffer from a massive fan-out problem. These keys are typically index keys. For example, a certain string value might be a default value set to all the nodes in the graph. A reverse index on this value could point to millions of nodes in the graph, hence creating huge posting lists. Dgraph would split such a posting list across multiple keys, so as not to exceed Badger’s value limits and also allow partial reads of this index key.

A typical key like this would have dozens of splits. We noticed, however, that some keys have thousands of splits – that’s possible when the fan-out is in billions of nodes. A query using this key would be slow at best and would crash the system at worst by causing a massive memory consumption or a massive CPU spike.

In v21.12, we have added a flag to forbid any key which has greater than 1000 splits by marking it forbidden in Badger. Once a key is forbidden, that key would continue to drop data and would always return an empty result.

Almost all backends we have seen (in Dgraph Cloud) are not affected by this change. But, in the rare case that a user is affected, rewriting the query to use another key for the root function would fix the issue. We think this small downside is worth the upside of keeping the system stable and performant.

GraphQL enhancements

Dgraph now provides more enhancements to GraphQL-powered applications with updates to directives, auth, and language tag support.

The new @default directive can automatically set default field values for newly created or updated nodes. This is especially useful for DateTime fields where you can set created and updated timestamps without needing to send timestamps from the client. This previously required a Dgraph Lambda function or separate business logic. Now this can be defined directly in the schema.

Types with @id fields can now have IDs that are updatable and optionally nullable. This gives the flexibility to change IDs after nodes have been created while also providing uniqueness with upserts. In cases where there are duplicate IDs from updates, an error is returned.

When using custom DQL, auth rules defined in GraphQL now apply to custom DQL queries as well. This means that DQL queries can be executed, limited accordingly based on the authorization rules defined in the GraphQL schema.

The new human-language support in GraphQL allows different language strings for the same field. These are defined as separate fields in the GraphQL schema while also utilizing the efficient storage for language strings found in DQL.

Take the Red Pill

The Dgraph Zion release has been an incredible effort by the team. A lot of work has gone into this release to make it the most performant ever. The features and optimizations mentioned above only cover a portion of the updates. For a full list, check out the release page on GitHub. Do try it out. Hope you like this release!


This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/v2112-release/
10 Likes

Do we have an idea of when we will see the cloud version of this?

Thanks,

J

4 Likes

48 CPUs and 268 GiB of memory

Can DQL provide the benchmark data of these tests? Because in my ldbc based test, the situation is not so optimistic. The most improvement is 2x, but many queries are even slower。

If you need this data, I can upload it

Thank you, but it’s still a very big improvement

I upgraded my very write-heavy production system and I would say from our observations, writes are not faster for our sustained input. Seems like bursts of input may be where the gains lie?

How about memory usage?

Forbid Massive Fan-outs
In v21.12, we have added a flag to forbid any key which has greater than 1000 splits by marking it forbidden in Badger. Once a key is forbidden, that key would continue to drop data and would always return an empty result.

We want to know how to handle the situation that 10,000,000 nodes has the same label? It’s hard to make the label as value in case of special treatment. Then in v21.12 how can I count total number of nodes belonging to the same label?

Sincerely hope to get your answer!