I am running Dgraph alpha on a c5ad.8xlarge AWS instance with 16 CPUs / 32 vCPUs, 64 GB RAM and 20 GBit network. Then I am hammering it with concurrent queries to see how many concurrent clients that node can handle before saturating. I would have expected CPU to hit 100% when I increase the number of concurrent clients, but I have found that I can only get it to use 22 CPUs.
The same happens for a c5ad.16xlarge instance with 32 CPUs / 64 vCPUs and 128 GB RAM. Only 22 of these CPUs are ever used.
Question: Can you explain what prevents Dgraph alpha from using all available CPU cores?
The network is not the bottleneck as it has one order of magnitude more capacity than what is transferred over the wire. Response gRPC messages are in the MBs. I have perf tested the network link and it could transfer 20 GBit per second.
The local SSD that stores the dataset is also not the bottleneck as the entire dataset fits into the OS file cache. I have measured no disk IO.
Dgraph version : v21.03.0
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true
Just a tip, badgerdb (the underlying embedded database) has a good amount of tunables, accessible via the badger flag of dgraph. This includes compactor numbers and goroutine counts, and may be able to help you use all of your CPUs more effectively.
@EnricoMi You might want to try increasing the value of GOMAXPROCS here. We’re explicitly setting the GOMAXPROCS to 128 which means golang will use 128 cores at max. The default value of this variable is equal to the number of cores available.