Improving performance and scaling for a key-value workload


#1

I’m load-testing a project that is expected to be relatively high-volume and high-throughput, and we’re seeing pretty severe performance degredation in Dgraph as the data size grows moderately large.

Our usage of Dgraph is primarily as a key-value document store, where each node is assigned a unique (indexed) key and has a blob of data associated with it. The data isn’t huge (a few KB on average), and based on some research [1] I’ve been under the impression that this should be no problem. Each document also potentially has a list of references to other documents, as well as some other auxiliary node types, which is where the graph comes in – but this isn’t immediately relevant, since as part of my investigation I’ve stripped out all these elements and performance hasn’t dramatically changed.

We are running a three-replica setup in Kubernetes based on (and almost identical to) the example [2] config. Below are the results of loading up about ~1.1m documents and ~1.3m other nodes (maybe ~10GB of source data). The trends are not very encouraging, with commits in particular regularly taking several seconds.

I’m trying to distinguish the cause of as a) poor design/query structure; b) resource constraints (i.e. just need more/better hardware); or c) limitations of Dgraph itself. I was hoping to get some guidance, especially on last point, as I’m not really sure what I should expect as a baseline for this use case.

As part of investigating this, I extracted a reproducer [3] with an extremely simplified version of our project.

The test program does the following. Given the following schema:

docKey: string @index(hash) @upsert .
body: string .
type Doc {
  docKey: string
  body: string
}
  1. Start a transaction
  2. Try to fetch a random node:
query Doc($key: string) {
  q(func: eq(docKey, $key)) {
    uid
    body
  }
}
  1. Mutate the body of the node (creating it if necessary)
  2. Commit the transaction

Running a three-node cluster locally, I see similar degredation – from about 150ms at start to over 650ms when approaching 1m nodes. Are these numbers in line with expectations for how this sort of primary key indexed read/write grows? Are there things we can try to improve it? Or is Dgraph not suited for this style of operation?

Thanks for any assistance!

I have been attempting to include links or urls, but the forum software is not letting me, and it’s defeating my attempts to get around it. So you will have to go searching for these.

[1]: the forum post ‘Storing blobs in Dgraph’
[2]: the ‘dgraph-ha’ YAML file in the dgraph repo
[3]: the github project ‘jfancher/dgraph-perf’