Dgraph can't horizontally scale write speeds

From what I understand, for any write (txn or otherwise),

  • alpha leader needs to get new UID from zero (if new data is added and not just an update)
  • change gets proposed to the alpha group and consensus needs to be reached i.e majority of the alphas need to acknowledge the write
  • then it needs to go to zero to mark the write/txn as complete
  • zero makes sure changes are consistent and issues timestamps, marking the change as complete
  • each of the alpha move txn from pre-write state to committed state.
  • Also, every write to same predicate HAS to go through same alpha group.

What all this means is that write speeds can’t be horizontally scaled.
For example, if I increase number of alphas in a group, (or increase number of zeroes in the cluster) it will not improve write speeds.

So is the below statement correct?
in dgraph, given a alpha group for predicateX, write speeds can’t be horizontally scaled for predicateX, by adding more alphas to the group it belongs to. (or by adding more zeroes to the cluster)

(Read speeds can be horizontally scaled by using best-effort txns)

@vikram

I don’t know the exact internals, but I made the following observation:

A single node instance is kinda okay speed wise. Relative to other graph systems queries a bit faster, but nothing terribly impressive. Mixed read/write workload does a bit better, but I wouldn’t trade in my existing system for that.

A six node instance, however, with a load balancer in front of it is just insanely fast that, no matter what I was throwing at it, I got always nano-second response time. I don’t even know how to explain that, but for some reason, it is insanely fast and substantially faster than other setups. A different graph DB delivered around ~50 - 100 ms write latency, yet another graph DB with a very awkward chosen storage engine delivered 150 ms - 300 ms write latency, but the Dgraph 6 node cluster was consistently below 1 ms with an average of about 0.3 ms on write op’s. Read queries were substantially faster. Memory footprint is about 5 - 10x lower, tho.

Can write operations scale horizontally?

I don’t know the answer to that question, but I would certainly welcome some insights from the core team because that is an issue very close to my heart as well. Also, I would actually love to know how it is even possible to keep latency consistently below 1 ms because that’s just completely nuts. No other graph DB that I have evaluated came remotely close to that ballpark.

I had and still have a fair amount of trouble integrating DGraph into our existing systems and there are still a number of open issues, but the performance I have seen so far warrants a very closer look. Scaling writes op’s relative to the number of nodes, however, is something I have to test next.

What have you measured?

In my setup,
Server => I had one zero and one alpha running locally on my macbook pro. lru was 2048 MB
Client => dgraph’s dgo client
Txn => 88,000 upserts performed sequentially.
=> each tnx searched for V1 and V2, then added edge between the two V1 -> V2

This took ~22 minutes!

I ran same tests on Janugraph, backed by single node cassandra, tests finished in ~140s!!

I plan to setup dgraph cluster with multiple zeroes and alpha, and run load test on it using gatling tool, i.e http client and dgo client.

In your setup

  • How many zeroes did you have?
  • What was the replica size?
  • lru size?
  • you had spinning disk or SSD?

@vikram

My single node setup was about the same as yours, one alpha, one zero with default config on a HP Spectre.

1 TB SSD, 16GB memory, i7 quad-core. I think LRU was set to 2GB as well. Pretty much standard I would say.

Cluster setup was the docker-compose with 3 alphas, 3 zeroes + ratel + ngnix load balancer dispatching each request to one of the three alphas.
The low latency was measured on many single transaction of mixed read & write, just as your typical web application would do so no batching nor sequential upserts.

Sequential inserts seem to cause some problems in DGraph, one way or the other, I think there are a number of reports very similar to yours. Also, you have to be mindful of memory configuration and some really weirdo stuff on sustained writes. If this issue is correct, you can expect about 10k online transactions per sec so technically your insert should take no more than 10 - 20 seconds, plus some overhead for adding edges, which raises some really, really hard questions.

=> each tnx searched for V1 and V2, then added edge between the two V1 -> V2
This took ~22 minutes!

On just 88k nodes? This is unreal, really, and I don’t think I have ever seen that before.

A few days ago, I did an evaluation of ArrangoDB, and given your node is sufficiently powerful, it runs relatively fast and scales nicely and their cluster setup is among the fastest I have done in quite a while. The enterprise version is free, but you need to claim your key and add it to the deployment.yaml. They claim some substantial performance gains over Janus and can handle 1Gb/sec write load, given a big enough cluster size, so I would do a comparison.

That said for my cluster deployment, I ran into some problems, reported a number of issued, asked in the forum a few times, got told I should solve the missing security myself, emailed DGraph twice, never got a reply, and you know, meanwhile I have given up on it. They have some good ideas, and I tried hard to make it work, but for us, it just doesn’t work in practice.

Can you share what Dgraph settings you used to reach that conclusion?

Can you share this too? so we can evaluate the contexts.

Also, you can use https://docs.dgraph.io/deploy/#tls-configuration and force auth by cert
--tls_client_auth= REQUIREANDVERIFY Also, certificates can be created to accept only certain IPs or URLs.

But the recommendation that I gave to you still up. It is safer to isolate it from users behind an API. Certs are an additional layer of protection, which fits into your previous context.

Here’s the my machine spec,

  • Macbook pro 13" 2017 model
  • 2.3 GHz Intel core i5
  • 16GB 2333 MHz LPDDR3 RAM

Data set

How I started dgraph

$docker run -d --rm -p 5080:5080 -p 6080:6080 -p 8080:8080 -p 9080:9080 -p 8000:8000 -v ~/dgraph:/dgraph --name dgraph dgraph/dgraph dgraph zero 
$docker exec -d dgraph dgraph alpha --lru_mb 2048 --zero localhost:5080 
$docker exec -d dgraph dgraph-ratel 

Here’s how I call the load code

func loadAllData(client *dgo.Dgraph, ctx context.Context){
        // skipped few lines of code here, but below the important bit
        start := time.Now()
	for i,d := range data{
		loadData(client, ctx, d[0], d[1])
		if i%100 == 0 {
			log.Printf("Added %d records\n", i)
		}
	}
	log.Printf("Loading took %s", time.Since(start))
}

Here’s the source code

package main

import (
	"context"
	"fmt"
	"github.com/dgraph-io/dgo/v2"
	"github.com/dgraph-io/dgo/v2/protos/api"
	"log"
)

func createSchema(client *dgo.Dgraph, ctx context.Context) {
	op := getSchema()
	err := client.Alter(ctx, op)
	if err != nil {
		log.Fatal(err)
	}
}

func getSchema() *api.Operation {
	op := &api.Operation{}
	op.Schema = `
# Types
type Node {
	id: int
	friend: [Node]
}

#index
id: int @index(int) .
friend: [uid] @count .
`
	return op
}

func loadData(client *dgo.Dgraph, ctx context.Context, id1 string, id2 string) {
	mu := &api.Mutation{
		SetNquads: []byte(prepareMutationQuery(id1, id2)),
	}
	req := &api.Request{
		Query:     prepareReadQuery(id1, id2),
		Mutations: []*api.Mutation{mu},
		CommitNow: true,
	}

	if _, err := client.NewTxn().Do(ctx, req); err != nil {
		log.Fatal(err)
	}
}

func prepareReadQuery(id1 string, id2 string) string {
	return fmt.Sprintf(`
  query {
    var(func: eq(id, %s)) {
      id1 as uid
    }
    var(func: eq(id, %s)) {
      id2 as uid
    }
  }`, id1, id2)
}

func prepareMutationQuery(id1 string, id2 string) string {
	return fmt.Sprintf(`
      uid(id1) <id> "%s" .
      uid(id2) <id> "%s" .
      uid(id1) <friend> uid(id2) .`, id1, id2)
}

When I run docker stat,
I see that CPU usage going upto 140%, and constantly staying above 118%

@MichelDiz

I sincerely apologize.

It turned out, the actual root cause of virtually all the integration troubles I had came from a strange issue in the default GKE ingress controller. Once I replaced the default with a customized Kong instance, things worked out of the box.

This one was absolutely my mistake because I didn’t drilled down enough into the internals of the standard GKE ingress controller until very recently just to figure out that my configuration wasn’t exactly well covered, so I replaced it with Kong and the troubles were gone.

With all that settled, I’m still positively surprised by the low latency DGraph shows in the HA config.

@MichelDiz You had time to look into this?

Over the weekend I couldn’t take a look. I’ll take a look this week.

But here are a few points before I dig deeper.

A single Alpha (+ zero) is not horizontal scaling. If you follow what @marvin-hansen said (regarding load balancing) you will be surprised.

About Janugraph (or Janusgraph?), I don’t know how it works. But the last time users compared other platforms with Dgraph. I noticed that the approach of other competitors is fundamentally different from Dgraph design. It would not be possible to compare directly, only abstractly. Maybe it’s a reality with Janugraph too. However, we analyze these facts and information so it helps the team to understand what needs to be improved. So keep doing benchmarks and comparisons.

4k vertices are not that much. 88k of operations should also not be a problem. Have you looked if your Docker is limited? processing, amount of RAM, disk access, etc.

Anyway, if the topic is about horizontal scalability. You need to test it accordingly in this direction. A single instance within a docker container (which may suffer limitations) is not a good benchmark.

Cheers.

@MichelDiz,

My comments about horizontal scaling still hold I think, I will do tests over HA setup to test my theory.
The fact that a given predicate always goes to same alpha group, and that every write needs majority consensus means, writes for same predicate can’t be scaled.

The reason for me to open this discussion thread was to understand, in theory, how dgraph can horizontally scale writes to same predicate. Could you please explain about, how and if it can be scaled?

About testing with single alpha + zero, I think it went off topic. I will open new discussion post if needed to further discuss this. I shared these numbers as @marvin-hansen was interested in what numbers I had. The janusgraph setup was same, with single janusgraph+cassandra +es setup. And when I ran dgraph, there were no limits set on docker containers. But yeah I will do these tests on HA setup, as HA is what gets deployed in production and not single alpha+zero.

That’s right. The write speeds don’t get better by adding more Alphas to the same group. They are replicas, so the same data has to be copied (done via Raft proposals).

Instead, you should try to:

  1. Do more upserts concurrently.
  2. Do more writes per upsert (if possible).
  3. If V1 and V2 dont’ already exist, other DBs make you create them first, then connect them. Dgraph could do all of that in one mutation (create and connect) and avoid doing upserts. That might or might not be applicable in your case.

@mrjn, Thanks for the clarification

I understand point 2 and 3,
I have few clarifications about 1,

Let’s say I have predicate xid on person.
If I want to add lot of person vertices, concurrent upserts will help because

  1. There is no global lock. So all txns proceed without blocking each other.
  2. Txns are resolved by zero, to either commit or abort (abort if txns are conflicting, proceed otherwise)

Is this the reason as to why concurrent upserts would scale well?

Yeah, no global lock. Queries / mutations can all work concurrently. 2 is correct as well.

@vikram

Just adding more alphas isn’t doing terrible much, and I guess you already figured that out. You have to add a load balancer to dispatch between alphas and then you measure better performance whenever you increase concurrent workload. At least that is what I see in my cluster.

Txn => 88,000 upserts performed sequentially.
=> each tnx searched for V1 and V2, then added edge between the two V1 -> V2

Are V1 & V2 indexed?
Are you using the has operator or do you use of equality operations when adding edges?

It might be possible that the search operation does something equivalent to a full table scan on non-indexed data and that would explain a lot and is most likely unrelated to scaling.

When I tested above scenario, I didn’t do concurrent upserts.
Which means neither adding alpha nor LB would have helped.
I did some more tests with sequential upserts vs concurrent upersts

  • janusgraph (backed by cassandra) has better sequential performance
  • dgraph easily beats janusgraph with concurrent upserts.

dgraph can handle really high (?) concurrent write qps, once we reach these limits, adding more alphas, even with LB infront will NOT increase the write qps.

However once predicate splitting is supported (Splitting predicates into multiple groups), we might be able to scale write speeds by splitting same predicate into multiple groups. But this could introduce new bottleneck by increasing network calls between alpha groups to serve same predicate.

I said might because it depends on how predicate splitting feature is done and increased network calls. @mrjn

Would you please elaborate your reasoning for your conclusion that writing isn’t actually scaling with nodes?

I haven’t had time to do pressure testing to hit whatever limits might exists, but DGraphs concurrent write throughput, from my observation, is pretty good. At least 100X better compared to what we had before, so in this department, I really have nothing to complaint about.

However, I would love to know why the load-balanced solution works so much faster and better than then un-balanced HA deployment?

TLDR:

  • writes need to go through raft consensus process
  • given alpha in a group can NOT act/decide/do anything independently or on its own

Here’s the detailed explanation

For the purpose of this discussion we are assuming we will not make any change to the hardware, i.e we are not vertically scaling.

Let’s assume the following

  • alpha group1 has N alphas, and stores predicate P, on a given hardware
  • and we max out at W writes per second with this setup.

With this, let’s say we add a n alphas to the same group, i.e we end up with N+n alphas in group1
After this change is done, writes per second will not increase, i.e W will not increase.

How writes happen

  • each alphas group is a raft group with a 1 leader and (N+n-1) followers
  • each of the alphas store copy of the same predicate P, i.e they are all replicas
  • to complete a given write w, a proposer needs to propose it to the rest of the alphas in the same group
  • this proposal needs to be replicated on majority of the alphas, (this is the raft consensus algorithm)
  • i.e (N+n)/2 + 1 alphas need to ack the writes

What does all this mean?

  • any/all writes needs to through raft process to complete the writes
  • simply adding new alpha will only increase number of followers, who need to ack writes

@vikram

Where do you get the (N+n)/2 + 1 from?

By definition, RAFT only requires 2n + 1 to cope with n failing nodes so how do y

For the scaling, after having read the somewhat updated design / concept docs, I noticed that:

  1. Lockless transactions don’t need distributed lock management, and I believe, this explains the very high throughput I see when load balancing.

  2. Sequential Consistency preservers total order, as opposed to the weaker partial order, and thus gives much stronger concurrency guarantees than, say, causal consistency. That seems to be a direct consequence from the chosen timestamp isolation level.

  3. Snapshot Isolation with write conflict resolution either aborts or commits a write transaction, which again, ensures cluster-wide consistency.

Based on these few bits, I can’t really determine why, by design or theory, write op’s can’t scale write speeds with the number of nodes assuming a fast network is given to handle the write/ops and raft overhead.

There aren’t terribly many alternatives for RAFT. The most useful I have found is called MultiRaft and somewhat brings order by partitioning RAFT groups to reduce cluster internal overhead requried to scale further. And it seems to work because MultiRaft has eventually merged into the storage engine of another DB

However, what realy bugs me though, the underlying badger can do anywhere between 70k to 135k Put, ops/sec and up to 200k Get, ops/sec, but we don’t see anything like this in DGraph.

According to RAFT, majority need to ack writes, and one of them must be leader

If you see my example, I have assumed
Alpha group has N nodes. and n new nodes are added.

Let’s say N = 3
(N / 2) + 1 = 2 (assuming integer division) i.e writes needs to be acked by at least 2, including the leader.

Lets say I add 2 nodes to group 1, where n = 2.
Majority would be (N + n) /2 + 1 = (3 + 2) / 2 + 1 = 3. i.e writes needs to be acked by one extra alpha, including leader

If total number of nodes in a group is (N+n), then majority is (N+n)/2 + 1

So irrespective of number nodes that get added to alpha group, majority need to ack the writes, including leader, which will become bottleneck for horizontal scaling. i.e leader will need to be vertically scaled, since every writes must go through the leader