Performance issues with v24.0.2

Our team upgraded to dgraph to v24.0.2 from v23.1.0 yesterday.
Unfortunately, this morning we had to revert the changes, as we saw very bad performance (compared to v23.1.0). To give an overview of our cluster configuration:

  • We are running on a single Hetzner AX162-R machine (alpha and zero shares this machine)
  • CPU AMD EPYC™ 9454P 48 Core
  • Memory: 1.12 TB
  • We run the alpha with the following cache configuration: --cache "size-mb=580197; percentage=10,65,25;"

Maybe you got some hints on how we can improve our configuration in order to run v24.0.2? Is there a known regression which could limit our performance? Below you can find performance metrics of the system. If necessary, we can share log output of the alpha, however I couldn’t find anything unusual. Apart from being super slow, the system operated as usual.

What is interesting to observe, is that memory usage is much lower with v24, which suggests to me, that the system is potentially not caching correctly. This lead to the hypothesis that the system is getting CPU bound, as we also observe much higher CPU usage. If you can think of any additional metrics that could be helpful I am happy to share them.

Below we graph metrics recorded during operation of v24 and compare them with operation of v23 of the same period one day before. The X-axis is for the night period, which shows very low traffic (usually from users outside our timezone). It spawns from Sep. 18, 11:30 pm – Sep 19, 9:30 am and data points are aggregated over 2 minutes.

Latency

This shows p75 latency over the grpc channel to Dgraph. It’s important to observe that v24.0.2 uses the left y-axis and v23.1.0 uses right y-axis. Note that these axes are capped respectively at 20s and 150ms. Also, in general our latency max at 20s, due to a hard 20s deadline that we enforce.

Memory usage

Here we see v24 consuming approx 65G of memory vs. 120G while operating with v23. We double-checked that size-mb=580197 was correctly set in both cases.

CPU usage

Here, we observe that CPU usage is much higher and unpredictable. Additionally, although you don’t see CPU reaching 100% since the data is aggregated over 2 minutes, when drilling into hot areas of the graph, we can see a plateau with the Alpha process utilizing the CPU to the max.

This situation is a bit scary, we rely heavily on Dgraph as our primary data source. We have been observing other problems already on v23, which motivated the upgrade to v24. The above results are giving us the feeling that we are in a dead end.

5 Likes

Hey @Joschka - I have had similar issues w/ older Dgraph binaries. The underlying problem could be because of Dgraph (and badger is my guess).

I would be curious to learn more about your data patterns, and queries you are executing. I would be interested in helping you out (if you are open to it).

I had spent months optimizing workloads in the past, when I learnt dgraph couldnt deliver on certain performance constraints I had.

Hi @honeybadger

Thanks for your reply. Our issue is not with an older binary though, but with the newest version of Dgraph (release grabbed from the GitHub release channel). We are using Dgraph since version v1.X (I just checked, the earliest commit on our codebase that mentions Dgraph is from March 2018) and never saw worse performance when upgrading to a newer Dgraph version.

We have seen bad performance with Dgraph in the past, but could always figure out that the problems was a bad mutation or query that we did. However, in this case the only changing parameter is the Dgraph version, which leaves us a bit clueless on how to proceed.

Hi Joschka, we have recently made some changes to cache. Maybe it’s happening because of that. Can you let me know what queries you are running and if you have any way to reproduce this so that I can test it?
Also, can you try downgrading right now, it should be a simple binary switch and confirm the issue is the upgrade itself? We can easily fix this if there are performance concerns due to upgrade in the next release.

Hi @harshil_goel

  • We essentially run a very wide set of queries across the system, and I am not sure that I can contribute the problem to any specific query. Diagnosing the system, it seems that every query operated equally worse.

  • We downgraded at Sep. 19, 9:30, running v24 from Sep. 18, 11:30 pm – Sep 19, 9:30 am. After downgrading, the system continued to operate normally.

We usually test Dgraph versions in development before releasing to production, to make sure that nothing breaks. Running Dgraph v24 for some weeks, we couldn’t see any major issues during development, that’s why we concluded that it’s safe to upgrade on production.
The difference between dev and prod is obviously load on the system. On development, we typically never exceed memory usage above 16G. During normal prod operation, the alpha process consumes between 90-120G, sometimes even more memory. That’s why it’s probably tricky to reproduce the issue.
I have to maybe emphasize again, that this also constitutes a major difference in behavior between v23 and v24. During v24 operation, we saw much lower memory usage than usual, as you can see in the graph above.

Perhaps we can discuss changes that you did to the caching system? If I understood these better, I could map them to some properties of our system that could break bc of the changes, and we could more easily try to create a reproducible scenario?

Hey @Joschka, so you were using cache percentage=10,65,25; The first 10% that you provided, went to Posting List Cache. This is the cache that are modifying. The problem was that this cache seemed to break consistency at points which was hard to debug. So we came up with a new cache mechanism, that is well tested by jepsen. This mechanism is not yet tested with a big load and could cause OOM issue, so we have limited the cache to only dgraph.type for now. If you are okay with a custom build, we could remove this limitation and let you try the binary? It should be significantly faster if you are seeing performance issues with v24.

OK, now things get interesting.

  • I am not particularly worried about OOM. Our prod system has 1.12TB of memory available, of which we normally never use more than 200G.
  • When you say ‘This is the cache that are modifying’, do you man that this is the cache system you have recently been working on?
  • The primary reason for us to initiate upgrading to v24 was actually a consistency issue. However, since we also couldn’t reproduce it we wanted to try upgrading to v24 first, before contacting anyone. Let me describe to you the inconsistency that we saw, maybe it’s similar to issues you have been observing:
  1. We have defined a predicate as part of our schema predA: uid @reverse . .
  2. We create a new node B, referencing an existing node A through that predicate: {"uid": "_:B", predA: { "uid": "A" }}. The mutation is successful.
  3. We query B through the reverse relationship: q(func: uid(A) { ~predA { uid } }. Here we identify the inconsistency: B does not show on the result of the query.
  4. We identified 2 ways to fix the inconsistency:
    a. create a new node C that uses the same predicate predA to reference A
    b. run Dgraph /alter to reapply the schema.

After running either (a) or (b) the relationship A→B through predA could correctly be queried.

I would be happy to try a custom build. Maybe you would even be open to schedule a call to test the build in collaboration? You might have a better eye for the alpha log output?

Yeah this could definitely have caused because of the older cache that we used. (Posting list cache). For now if you are able to build dgraph yourself, you can just modify this file (Sorry we will make it a flag soon):

mvcc.go: ShouldGoInCache function, remove the strings.HasSuffix(pk.Attr, "dgraph.type")) part

This would make it that all the items in your database would get cached.

. I would love to get on a call with you, I will figure out how to get it scheduled.

1 Like

Hi @harshil_goel

After we got the hint that PostListingCache could have been the source of our issues with @reverse and following your suggestion regarding ShouldGoInCache we did 2 experiments:

  1. Using a v23 binary, we removed --cache="percentage:10,65,25" which in our understanding reverts to the default of 0% for PostListingCache.
  2. We build a modified v24 binary with your suggested changes:
 func ShouldGoInCache(pk x.ParsedKey) bool {
-       return (!pk.IsData() && strings.HasSuffix(pk.Attr, "dgraph.type"))
+       return (!pk.IsData())
 }

Here are our findings:

Between 4:50pm and 5:00pm, our experience running without PostListingCache was terrible. Latencies were reaching extremely high numbers. At 5:00pm we reverted.
Between 5:21pm and 5:41pm, we were running the modified v24 binary. Latencies were much more manageable, compared to our previous experience with v24 (graphs from the original post), however in comparison to v23+PostingListCache still high (as can be observed from 5:00pm til 5:20pm).
In this graph, for the period of the second experiment (modded v24) note a red dashed curve. That represents the CPU usage of the alpha process. It follows the right y-axis in percentage.

This is the CPU usage of the alpha process during the second experiment.

This is memory usage of the alpha process during the second experiment.
What we observe is that CPU increase follows memory allocation. Additionally, we have noticed that the memory gets deallocated much more eagerly than our v23 experience. In v23, as can be seen in the graphs of the original post, memory usage is more stable.

I am still interested in discussing this in a call. I am sure you have questions about my usage of Dgraph that can improve your understanding which could be clarified better in sync. Looking forward to hear more from you.

This is in line with what we expected to see. This new cache is still being developed so its bit slower right now. We are working on improving it. Can you also try once with ShouldGoInCache() to return True directly? This would enter more items in the cache.

I am about to end my day, I can follow up on this on Monday. Just to be sure, you are talking about this patch?

modified   posting/mvcc.go
@@ -535,6 +535,7 @@ func (c *CachePL) Set(l *List, readTs uint64) {
 }
 
 func ShouldGoInCache(pk x.ParsedKey) bool {
+	return true
 	return (!pk.IsData() && strings.HasSuffix(pk.Attr, "dgraph.type"))
 }

Yeah that’s correct. Could you also record CPU profiles next time you run it? That would help us figure out where to focus for improvements.

Sure, will do that :+1:

1 Like

I just ran another experiment with the patched ShouldGoInCache function. Unfortunately, results don’t look to promising.

This shows Dgraph client latency within the 5 minutes that I ran the patched binary. v23 is on the left y-axis, the v24 modified version is on the right y-axis. As you can see, where latency for v23 is measured in ms, the latency for the modified v24 is measured in seconds. Avg for v24 is 500ms, avg for v23 is 22ms.

As discussed, this is the output of dgraph debuginfo with the modified version of v24 running in production. It doesn’t include the zero files, nor the dgraph trace file, since it’s bigger then the max file limit that I can upload here (222MB). If you are interested in this file as well, I can try to provide it through different means.

alpha_block.gz (718 Bytes)
alpha_jemalloc.gz (193 Bytes)
alpha_metrics.gz (174.1 KB)
alpha_mutex.gz (719 Bytes)
alpha_state.gz (305.5 KB)
alpha_threadcreate.gz (890 Bytes)
alpha_cpu.gz (90.7 KB)
alpha_goroutine.gz (1.4 MB)
alpha_health.gz (185 Bytes)
alpha_heap.gz (82.2 KB)

Would you have time now for a call?
I am not able to tell it from the charts, but was the last version you tried better than this or worse? The one where had “pk.IsData()” was still there? From the profile it looks like the cache is overburdened. Right now the cache is implemented very cautiously, and memory management is still not done. We are going to keep improving the efficiency of the new cache while maintaining ACID compliance. Meanwhile, if this doesn’t suffice you, we could re-introduce the posting list cache under a flag. However we don’t recommend this setting.

hey guys, was there any resolution to these issues from your call that you’d be willing to share? I’m curious for how it may relate to my own deployments.

1 Like

@raddi, We have decided to revert back posting list cache changes under a flag, so that you could continue to use the cache. However it could have consistency related issues. Meanwhile we will try to improve the performance of the new cache and maybe try to merge it with Risteretto

3 Likes

Are you releasing a v24.0.3 to include the flag?

Yes, we are releasing v24.0.3. It will the old flag as it was, until we phase it out again.
Meanwhile we have been testing with an upgrade to the new cache layer. It is passing our CIs and jepsen tests after some time now. If you guys could test out this branch, it would give us more confidence in releasing it. Right now we are going to refactor the code a bit to enhance readability.

Just ran your branch with our CI/CD tests, everything in our application seems functionally OK. Performance was roughly the same for us, but we are typically bottlenecked by Dgraph write speed, which is probably unaffected by this cache change? One small note, there are still some print statements left behind, I get a lot of these:

2024-10-07 14:27:45 Deleting old items
2024-10-07 14:27:45 Done deleting old items

1 Like