Benchmarking GraphQL performance

We want to

  1. Measure GraphQL query/mutation performance and see if there are bits of code that we can improve. We would also like to see what is the overhead being added by query rewriting and processing apart from the time it takes to process the query by Dgraph.
  2. Load testing and getting P50, P95, P99 latencies for query/mutation performance.

Measure individual query/mutation performance

  1. As a first step, we come up with 10-20 queries/mutations that would form the most common use case. We can use the suggestions by Harshil. These are the kind of queries we can look at
    a) Queries

    • getType/queryType - single level
    • getType/queryType - multi level
    • getType/queryType - single and multi-level for interfaces
    • getType/queryType with auth - single and multi-level
    • getType/queryType along with custom fields

    b) Mutations

    • add/update mutations - single level with id/xid
    • add/update mutations - deep with id/xid
  2. These queries/mutations can be part of a benchmarking suite (we can use Go benchmarks for this) where each query/mutation would be run for 10-60 seconds.

    • We can look at the memory/CPU profiles of Dgraph during these runs and figure out which bits are taking the most memory/time. Ideally these should be bits within dgraph that are required to process the GraphQL± query. If not, then we should look at the slow parts and optimize them.
    • These benchmarking tests should also measure and tell us the ratio of time taken to pre and post-process the query against the total time to process the query. This would give us an idea of the overhead added to the query. We should aim to minimise this.
    • auth queries should also check the overhead added by auth rules over normal queries.

Load testing

GitHub - tsenart/vegeta: HTTP load testing tool and library. It's over 9000! can be used for doing this part of the testing.

  1. We should do load testing using concurrent clients to measure P50, P95, P99 latencies of the queries mentioned above. For this, we can choose 10, 100, 1000 concurrent clients and see how the latency changes in response.
  2. Along with the queries above, we should also test subscriptions and see how many concurrent subscriptions are supported and how do they depend on the number of open file descriptors.

Once we get some initial latency/throughput numbers from our load testing, then we should find a way to store them and compare them across releases.

Hi Pawan
Great post! I think the list is detailed for different queries and mutations we would want to test on. I have a few things we probably could add:

  1. Test running a query and mutation at the same time and monitor performance
  2. Running multiple threads of random queries/mutations from different users parallelly

Randomness in queries would help us duplicate a near production scenario.
Also different datasets used should also play a factor, so maybe we could divide the tests to be run on 2 different datasets with medium and heavy data.
We should also run a soak test to see if we face errors like memory leaks etc. This would run for 24-48 hours.

I would prefer the load testing tool to be Locust because it supports scripting in python and also provides an easy UI for running the tests. It would be easy to integrate Locust on the CI and run it against different datasets we have. We could use distributed testing with cloud resources and have a database that saves the reports each run. We could then easily compare them across releases. This would require our load testing framework to be deployed on the same cluster as our GraphQL server.

Let me know what you think!

I want these two points:

For me, the ‘final result’ should pretty much be something that we can run (say prior to every release, whenever we like, or automated on every PR that mentions ‘graphql’) and keep track of that performance. …where that data gets stored pretty much has to be Slash GraphQL for me.

When I say ‘final result’ I think I pretty much mean ‘first result’. I’d hate for our effort to get lost, or not be visible. Otherwise, we spend effort on this, but over time that effort is degraded but future fixes, and or doesn’t give us insight into what to do next. I also expect that the load testing (now that we have some instrumentation to measure internally, thanks to what @JatinDevDG just built) will give us a guide of where to direct efforts.

It also looks great if we can say that this is what it looked like at the start, then after the first sprint it looks like this, then like this, now we have these priorities going forward, etc. Way better than we did some work and it’s faster now and we’ll now build something that’ll show you what that looks like for a customer or across releases.

So I think I’d rather see the long term visibility first, then we pick off the high priority items in turn.

Thoughts?

2 Likes

We could include a test which runs query and mutations at the same time and use that for load testing. I am not so sure about random queries/mutations because we can’t benchmark that across releases as the latency would depend on the queries/mutations and their ratios.

I think these tests are better run for Dgraph using GraphQL+- using gRPC directly. Our aim is to validate that GraphQL itself should not add a lot of overhead to the performance that Dgraph already gives. They can be run using GraphQL but I’ll keep them at a lower priority for now.

Sure, we can look into using Locust.

I agree the final result should be something that we can run and compare the performance of across releases or PRs.

The part I am not sure of is how often do we need to do it and if we should store it in a database. In the past, we have followed the approach of storing benchmark results in a README file and comparing that over releases. Something like Query benchmarks. This is a straight forward and simple approach. We also have https://godoc.org/golang.org/x/tools/cmd/benchcmp which can take two benchmark outputs and tell us the %ge difference between them. These go benchmarks can be run for a release for PRs which are making significant changes to code, like say the PR which fixed OOM issues.

Say alternatively, we go down the path of storing the results of these benchmarks in Slash GraphQL, the storing part is easy. Retrieving those results, charting those over time and making sense of the raw data would still have to be done separately and would take some time to build. I believe our work would give us more value if we spent our time doing more testing and benchmarking over the next 3 weeks.

Fair enough. Let’s have a look at a work plan for that.

Listing down the individual tasks which would be converted to JIRA tickets

  • getType/queryType single and multi-level queries

Also to be tested with interfaces. These would be written as Go benchmarks which would make an HTTP call to an Alpha running on the same instance. We should be measuring the following things after coming up with an appropriate set of queries.

a) CPU and memory profile to see which are the most CPU/memory intensive bits. They should not be the code that we have in GraphQL.

b) Measuring the ratio of time pre and post-processing takes vs the time taken to execute the GraphQL± query.

c) Doing load-testing using Locust or vegeta to measure the p50, p95 and p99 latencies for these queries with increasing number of clients 10, 50, 100, 500, 1000. This might require bringing up more machines.
@JatinDevDG would be a good candidate for this and can learn a lot around doing benchmarking in Go as well.

  • Auth queries - single and multi-level queries.

We would measure things similar to what we have above (points a, b and c). Just that we should also see how much time do the evaluation of auth rules take. So we should also measure and compare the time taken to execute a query with auth vs the same query without an auth rule. Here, we want to be able to see if we can rewrite the auth rules to be more performant. @arijit can probably take this up. Feel free to add any more ideas that you might have.

  • custom fields

The aim here is to see if we can parallelize our execution of custom fields or make it more efficient by computing some bits at schema update time instead of query execution time. We’ll again be measuring similar things, taking the CPU/memory profiles and checking what bits are taking most time. We would also like to do load testing and see that we are not initiating too many HTTP clients for different requests. @abhimanyusinghgaur would be a good candidate for this.

  • Mutations

a) add/update mutations - single level with id/xid
b) add/update mutations - deep with id/xid

We would again check the CPU/memory profile and the %ge of time taken for pre and post processing vs the actual mutation. We should also do load testing to measure the latencies with increasing number of clients. I (@pawan or @vardhanapoorv ) can take this up.

2 Likes

Also, for mutation and query we should see json serialization and deserialization time for larger payloads. Some libraries support parallel encoding which is 1.5x to 2x faster than encoding/json package.

1 Like

Since we process requests parallelly, we should run the tests with Go’s race detector.

1 Like

I agree, we’d not be able to benchmark that across releases. I think randomness would facilitate mimicking production usage. Do you see value in running it once in a while without benchmarking?

Makes sense.

I’m not sure how much time I can get outside Slash but would be great if I could shadow/help-out @JatinDevDG wherever possible.

1 Like

Maybe we can include schema related things also here, like what you suggested today. And that might help us answer these questions as well - What are the limits for a schema for GraphQL.

2 Likes

You are missing the mutation/delete use case as well. I know bulk deletes are problematic in the current graphql implementation. It would be good to benchmark those so we can document improvement over time.

Other Queries I would recommend adding:

  • query using using geolocation filtering for distance and within bounds
  • query using the different string hashes
  • query Lambdas to go along with your custom fields
  • mutation Lambdas for insert,update, and delete

The tests should be done against DGraphCloud to handle all scaling since that is your flagship offering now for auto-scaling.