I am running Dgraph alpha on a c5ad.8xlarge AWS instance with 16 CPUs / 32 vCPUs, 64 GB RAM and 20 GBit network. Then I am hammering it with concurrent queries to see how many concurrent clients that node can handle before saturating. I would have expected CPU to hit 100% when I increase the number of concurrent clients, but I have found that I can only get it to use 22 CPUs.
The same happens for a c5ad.16xlarge instance with 32 CPUs / 64 vCPUs and 128 GB RAM. Only 22 of these CPUs are ever used.
Question: Can you explain what prevents Dgraph alpha from using all available CPU cores?
The network is not the bottleneck as it has one order of magnitude more capacity than what is transferred over the wire. Response gRPC messages are in the MBs. I have perf tested the network link and it could transfer 20 GBit per second.
The local SSD that stores the dataset is also not the bottleneck as the entire dataset fits into the OS file cache. I have measured no disk IO.
Dgraph metadata
dgraph version
Dgraph version : v21.03.0
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true
Just a tip, badgerdb (the underlying embedded database) has a good amount of tunables, accessible via the badger flag of dgraph. This includes compactor numbers and goroutine counts, and may be able to help you use all of your CPUs more effectively.
@EnricoMi You might want to try increasing the value of GOMAXPROCS here. Weâre explicitly setting the GOMAXPROCS to 128 which means golang will use 128 cores at max. The default value of this variable is equal to the number of cores available.
It looks hard-coded at 128 for dgraph. Anyway to confirm this? Iâve got 80 cores assigned to a dgraph container and itâs making use of about ~15-30% CPU.
Right, just checked and indeed GOMAXPROCS is hardcoded to 128, which would override any environment variable. This line has been in place since 2017, my guess is that Manish (the author) thought that 128 cores would be plenty (headroom for additional threading that would be implemented in the future).
My âeducated guessâ is that for an alpha (with no badger tuning defined) there are only ever ~20 threads that get created to handle various tasks that it has to perform.
My suggestion, try changing some -badger superflag options, particularly this one:
numgoroutines=8; The number of goroutines to use in badger.Stream.
You may find that increasing this (and perhaps others) has the effect of higher cpu utilization when the alpha is under load.
Have you tried running dgraph/dgraph:v21.12.0 to see if thereâs any difference? Not saying thatâs a solution, but it would bring more insight to the issue. Those changes you referred to in the referenced post, along with a new bitmap scheme (âroaring bitmapsâ) definitely brought some performance improvements with v21, butâŚ
âŚAs you probably are aware, âZionâ (the v21 release & branch) was abandoned for reasons too numerous to go into here.
I have not tried that version to tackle the problem, but may be worthwhile for insights to the community.
I recall there being some incompatibility issues with the v21.12 and the current baseline. Would I have to ingest data from scratch on the v21.12 line or would you think I could import an export from the current 23.x line into v21.12?
For any perf issue it is best to first try to figure out what the limit is. Changing settings can work or be informative, but is sort of shooting in the dark.
Typically, low CPU means that any DB server is limited some other way:
IO is maxed out (check linux iostat output)
there is lock or mutex contention (chedk golang block profiles)
the load driver is not sending enough of a load to the server (client logging, go trace, Dgraph pending query metric, etc. may help ensure there are actually as many concurrent requests as you think).
queries are blocked on large or slow mutations. The âapplyChâ will back up and OpenTelemetry (âJaegerâ) data will show âprocessTaskâ spans in a query blocked âwaiting for startTS/Done waiting for MaxAssignedâ which means the transaction TS is not fully applied and queryable, causing cpu-idle waits.
(Note âapplyCh backing upâ means the Prometheus metric applyCh is higher than low-single-digits. This metric tells how many things are in the golang apply channel, which is a kind of golang queue used to coordinate asynch goroutines internal handoff of tasks. This channel holds transaction info that has been safely committed, but not fully processed to the point it can be queries by new queries.)
IO is not maxed out, iostat, metricbeat, etc. confirms this.
Lock or mutex contention (check golang block profiles) â still need to do this.
Load driver is not sending enough, tracking dgraph_pending_queries_total shows pending queries to always be 1, very seldomly 2.
Queries are blocked on large or slow mutations. â details below.
Weâre pulling promethues stats into Elastic/Kibana. I have charts tracking: dgraph_num_queries_total filtering on Server.Mutate, dgraph_num_queries_total filtering on Server.Query, dgraph_txn_commits_total, go_goroutines, dgraph_raft_applych_size for each Alpha.
When dgraph_txn_commits_total goes to zero, dgraph_raft_applych_size is peaking at 1000.
Then dgraph_raft_applych_size goes to zero and dgraph_txn_commits_total goes high ~6,000/minute. They seem to alternate each other. During this whole time dgraph_num_queries_total (Mutate and Queries) are both zero.
dgraph_txn_commits_total is the total transactions committed since the server started, so thatâs probably a prometheus/kibana rate that youâre looking at. If it goes to zero, transactions are not happening (or one or more huge txns are in process and wonât increment the counter until done).
So it seems like one or more transactions are stuck, or transactions are blocked for another reason. The simplest reason would be a huge transaction (rabbit in the python) that is bogging down some critical section that holds a lock internally to the server. Until the transaction is fully processed (as indicated by the delay between StartTS/MaxAssigned in jaeger) queries will block too.
This is consistent with the big rush of txns after the stuck phase completes. Transactions (perhaps smaller txns) pending or queued up then run through the system quickly.
You might also look at edges created per second (rate of dgraph_num_edges_total) because a massive transaction will cause a spike there.
â
Long story short, the most likely thing is a massive update transaction that clogs the system for a while.