CPU unutilised by dgraph alpha

What I did

I am running Dgraph alpha on a c5ad.8xlarge AWS instance with 16 CPUs / 32 vCPUs, 64 GB RAM and 20 GBit network. Then I am hammering it with concurrent queries to see how many concurrent clients that node can handle before saturating. I would have expected CPU to hit 100% when I increase the number of concurrent clients, but I have found that I can only get it to use 22 CPUs.

The same happens for a c5ad.16xlarge instance with 32 CPUs / 64 vCPUs and 128 GB RAM. Only 22 of these CPUs are ever used.

Question: Can you explain what prevents Dgraph alpha from using all available CPU cores?

The network is not the bottleneck as it has one order of magnitude more capacity than what is transferred over the wire. Response gRPC messages are in the MBs. I have perf tested the network link and it could transfer 20 GBit per second.

The local SSD that stores the dataset is also not the bottleneck as the entire dataset fits into the OS file cache. I have measured no disk IO.

Dgraph metadata

dgraph version

Dgraph version : v21.03.0
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.

Not sure, try to add more Alphas in the same machine. If it won’t work, maybe there is some limit somewhere in the core code.

Right, more alphas on a node would be a workaround to utilize all CPUs.

Still, might be interesting for core code devs to look into.

Just a tip, badgerdb (the underlying embedded database) has a good amount of tunables, accessible via the badger flag of dgraph. This includes compactor numbers and goroutine counts, and may be able to help you use all of your CPUs more effectively.

1 Like

That sounds like a promising route to go, I’ll give that a try.

Yep, those tweaks should have a deeper docs and tests(of scenarios).

1 Like

@Sagar_Choudhary this might be relevant reading your High memory utilization on alpha node (use of memory cache) post as your nodes have 32 vCPUs. My experience with v21.03 is that at most 22 of those will be used by Dgraph. Have you made similar observations?

@EnricoMi You might want to try increasing the value of GOMAXPROCS here. We’re explicitly setting the GOMAXPROCS to 128 which means golang will use 128 cores at max. The default value of this variable is equal to the number of cores available.

See runtime package - runtime - pkg.go.dev

Thanks for the hint. How does the 128 explain only 22 CPUs are used? Shouldn’t we see a limit at 128 CPUs here?

@EnricoMi I am not really sure but you could unset that value and see if that changes the CPU usage.

How do you set this? I don’t see a reference in the CLI Docs: Dgraph CLI Reference - Deploy

GOMAXPROCS is an environment variable. So it depends on your deployment technique.

It looks hard-coded at 128 for dgraph. Anyway to confirm this? I’ve got 80 cores assigned to a dgraph container and it’s making use of about ~15-30% CPU.

Right, just checked and indeed GOMAXPROCS is hardcoded to 128, which would override any environment variable. This line has been in place since 2017, my guess is that Manish (the author) thought that 128 cores would be plenty (headroom for additional threading that would be implemented in the future).

My “educated guess” is that for an alpha (with no badger tuning defined) there are only ever ~20 threads that get created to handle various tasks that it has to perform.

My suggestion, try changing some -badger superflag options, particularly this one:

numgoroutines=8; The number of goroutines to use in badger.Stream.

You may find that increasing this (and perhaps others) has the effect of higher cpu utilization when the alpha is under load.

I already updated the numgoroutines to 32. No major thruput difference. I’m submitting a lot of mutations, which may be part of the issue. Posted about that here: Upper Transaction Mutation Limit - Concurrent Mutation Processing?

Have you tried running dgraph/dgraph:v21.12.0 to see if there’s any difference? Not saying that’s a solution, but it would bring more insight to the issue. Those changes you referred to in the referenced post, along with a new bitmap scheme (“roaring bitmaps”) definitely brought some performance improvements with v21, but…

…As you probably are aware, ‘Zion’ (the v21 release & branch) was abandoned for reasons too numerous to go into here.

I have not tried that version to tackle the problem, but may be worthwhile for insights to the community.

I recall there being some incompatibility issues with the v21.12 and the current baseline. Would I have to ingest data from scratch on the v21.12 line or would you think I could import an export from the current 23.x line into v21.12?

For any perf issue it is best to first try to figure out what the limit is. Changing settings can work or be informative, but is sort of shooting in the dark.

Typically, low CPU means that any DB server is limited some other way:

  • IO is maxed out (check linux iostat output)
  • there is lock or mutex contention (chedk golang block profiles)
  • the load driver is not sending enough of a load to the server (client logging, go trace, Dgraph pending query metric, etc. may help ensure there are actually as many concurrent requests as you think).
  • queries are blocked on large or slow mutations. The “applyCh” will back up and OpenTelemetry (“Jaeger”) data will show “processTask” spans in a query blocked “waiting for startTS/Done waiting for MaxAssigned” which means the transaction TS is not fully applied and queryable, causing cpu-idle waits.

(Note “applyCh backing up” means the Prometheus metric applyCh is higher than low-single-digits. This metric tells how many things are in the golang apply channel, which is a kind of golang queue used to coordinate asynch goroutines internal handoff of tasks. This channel holds transaction info that has been safely committed, but not fully processed to the point it can be queries by new queries.)

I’m working through these:

  1. IO is not maxed out, iostat, metricbeat, etc. confirms this.
  2. Lock or mutex contention (check golang block profiles) — still need to do this.
  3. Load driver is not sending enough, tracking dgraph_pending_queries_total shows pending queries to always be 1, very seldomly 2.
  4. Queries are blocked on large or slow mutations. – details below.

We’re pulling promethues stats into Elastic/Kibana. I have charts tracking: dgraph_num_queries_total filtering on Server.Mutate, dgraph_num_queries_total filtering on Server.Query, dgraph_txn_commits_total, go_goroutines, dgraph_raft_applych_size for each Alpha.

When dgraph_txn_commits_total goes to zero, dgraph_raft_applych_size is peaking at 1000.
Then dgraph_raft_applych_size goes to zero and dgraph_txn_commits_total goes high ~6,000/minute. They seem to alternate each other. During this whole time dgraph_num_queries_total (Mutate and Queries) are both zero.

What would that indicate?

dgraph_txn_commits_total is the total transactions committed since the server started, so that’s probably a prometheus/kibana rate that you’re looking at. If it goes to zero, transactions are not happening (or one or more huge txns are in process and won’t increment the counter until done).

So it seems like one or more transactions are stuck, or transactions are blocked for another reason. The simplest reason would be a huge transaction (rabbit in the python) that is bogging down some critical section that holds a lock internally to the server. Until the transaction is fully processed (as indicated by the delay between StartTS/MaxAssigned in jaeger) queries will block too.

This is consistent with the big rush of txns after the stuck phase completes. Transactions (perhaps smaller txns) pending or queued up then run through the system quickly.

You might also look at edges created per second (rate of dgraph_num_edges_total) because a massive transaction will cause a spike there.

Long story short, the most likely thing is a massive update transaction that clogs the system for a while.