Our team upgraded to dgraph to v24.0.2 from v23.1.0 yesterday.
Unfortunately, this morning we had to revert the changes, as we saw very bad performance (compared to v23.1.0). To give an overview of our cluster configuration:
- We are running on a single Hetzner AX162-R machine (alpha and zero shares this machine)
- CPU AMD EPYC™ 9454P 48 Core
- Memory: 1.12 TB
- We run the alpha with the following cache configuration:
--cache "size-mb=580197; percentage=10,65,25;"
Maybe you got some hints on how we can improve our configuration in order to run v24.0.2? Is there a known regression which could limit our performance? Below you can find performance metrics of the system. If necessary, we can share log output of the alpha, however I couldn’t find anything unusual. Apart from being super slow, the system operated as usual.
What is interesting to observe, is that memory usage is much lower with v24, which suggests to me, that the system is potentially not caching correctly. This lead to the hypothesis that the system is getting CPU bound, as we also observe much higher CPU usage. If you can think of any additional metrics that could be helpful I am happy to share them.
Below we graph metrics recorded during operation of v24 and compare them with operation of v23 of the same period one day before. The X-axis is for the night period, which shows very low traffic (usually from users outside our timezone). It spawns from Sep. 18, 11:30 pm – Sep 19, 9:30 am and data points are aggregated over 2 minutes.
Latency
This shows p75 latency over the grpc channel to Dgraph. It’s important to observe that v24.0.2 uses the left y-axis and v23.1.0 uses right y-axis. Note that these axes are capped respectively at 20s and 150ms. Also, in general our latency max at 20s, due to a hard 20s deadline that we enforce.
Memory usage
Here we see v24 consuming approx 65G of memory vs. 120G while operating with v23. We double-checked that size-mb=580197
was correctly set in both cases.
CPU usage
Here, we observe that CPU usage is much higher and unpredictable. Additionally, although you don’t see CPU reaching 100% since the data is aggregated over 2 minutes, when drilling into hot areas of the graph, we can see a plateau with the Alpha process utilizing the CPU to the max.
This situation is a bit scary, we rely heavily on Dgraph as our primary data source. We have been observing other problems already on v23, which motivated the upgrade to v24. The above results are giving us the feeling that we are in a dead end.