Thanks for the build! I restarted the nodes and paused all ingestion load on them- and while it looks like the replays are going faster, the cluster is unhealthy and not responding to any requests (and eventually oom-killing as it tries to come back online). Heres a 10 min log from one of the alphas before it oom’d. It’s been in a pattern like this for ~30 mins.
Looks like it reached file id: 531 before crashing, starting from 270. If it starts higher than 270 and closer to 500 on the next restart, then after a few OOMs, it should stabilize.
If you could share the p directory with us, that’d also help us figure out if we can optimize something here.
Hrmm, it looks like we’re restarting at file id 270 each time we oom, here’s logs from two adjacent restarts of the same alpha, it stays unhealthy through these
The last compaction in alphafirst was level 0 compaction with a score of 138.2. That compaction started at 23:36 but didn’t complete even at 23:49 (13 minutes) and the alpha crashed. It’s also interesting that there were so many tables at level 0. A score of 138.2 would mean level 0 has 691 tables (138.2*5=691 tables) badger/levels.go at aadda9aa27fdf762cf36e1a150f796fb094fbc10 · dgraph-io/badger · GitHub . My guess is that such a huge compaction caused the OOM. Also, the decompression is known to take a lot of memory. We reduced the memory used by decompression in badger https://github.com/dgraph-io/badger/pull/1308 but it’s not yet in dgraph.
@seanlaff Using FileIO table loading mode dgraph --badger.tables=disk to start dgraph just so that replay can catch up might help. Once the replay is successful, you can kill and start the alpha in memory map mode.
Thanks for the deep dive. I gave the badger disk tables option a shot, but unfortunately not seeing a difference. Here’s logs from two adjacent restarts. It appears its still struggling to break that file id 270 to 500 range.
Im around 26gb mem right now, but throwing more hardware at this is an option. Although, my concern is even with more hardware, it doesn’t seem like the duration of the replay will decrease. Is a node considered unhealthy until the replays finish? (That seems to be my experience, as its not responding to my api calls). If so, is there a good HA pattern, as 10min+ boot time is a tough pill to swallow
@seanlaff can you get a memory profile around the crash? I can make some changes so that compaction doesn’t eat up all the ram and allows alpha to start. I believe the table building in compaction is causing the crash and a memory profile would confirm my suspicion. I can create a test binary for you which would reduce the memory usage and start the alpha.
Yes, the node is considered unhealthy until the replay completes. The replay is done by badger (the storage layer) and the node cannot serve any requests until the storage layer is ready.
Yes. The WAL replay is causing the issue. Badger is using too much of memory.
You can control the size of vlog (badger WAL) by setting valueLogMaxSize or ValueLogMaxEntries but these won’t have any effect until the replay completes. Only the new vlog (badger WAL) files be affected by these options. Not the existing ones.
What is the size of the p directory and the w directory and how many sst and vlog files does each one of them have?
> ls w/*.sst | wc -l
263
> ls w/*.vlog | wc -l
3
> ls p/*.sst | wc -l
4239
> ls p/*.vlog | wc -l
315
Last night this dir was 60gb, and now its up to 100. I haven’t ingested any new data since then, maybe things are getting left around on disk after each restart since it doesn’t seem to be getting through the replay?
@seanlaff The data from vlog files (badger WAL) is being replayed and added into the LSM tree which is supposed to create more .sst files. The increase in on-disk data is expected.
@seanlaff Would you have some time for a call? I would like to make some changes to badger and run it against your data. If you can’t have a call, I can create a docker image for you with some modifications.
I rebuilt the cluster from scratch (no data) with all the aforementioned changes from the getgo (using the dgraph/dgraph:v20.03.1-l0notinmemory image, --badger.tables=disk, --ludicrous_mode)
On my previous cluster, the logs showed the replay never advancing (it’d oom and restart at the same file id), which is why I wanted to raise the suspicion that the data dir was growing even though replaying was working across the same files over and over again (maybe this is intended and later cleaned up by a compaction step).
In this new cluster, given enough time I was able to get myself in a roughly similar scenario, however it does seem like the badger replays are making advances, although extremely slowly since it oom kills. It’s gotten through 1000 file ids in 12 hours.
Since the nodes are a little more happy than they were before, I was able to get a heap snapshot- although it only shows a few gigs while the process itself has over 24gb.
Trapped in the same problem, and the WAL replay caused this problem. Badger is using too much of memory and dgraph oom after restart.
After trying modify KeepL0InMemory=false/–badger.tables=disk/–badger.vlog=disk, alpha can still not recover. Do you have any idea how to solve this, it is urgent!!!
Hey @seanlaff, apologies for the late reply. I got occupied with something else.
I see this in the logs
I0519 12:11:03.853438 1 log.go:34] 455 tables out of 654 opened in 3.012s
I0519 12:11:05.119154 1 log.go:34] All 654 tables opened in 4.277s
I0519 12:11:06.410063 1 log.go:34] Got compaction priority: {level:1 score:1.0842859745025635 dropPrefix:[]}
I0519 12:11:06.410377 1 log.go:34] Running for level: 1
I0519 12:11:06.911194 1 log.go:34] Got compaction priority: {level:1 score:1.014364082366228 dropPrefix:[]}
I0519 12:11:06.911364 1 log.go:34] Running for level: 1
I0519 12:11:06.960332 1 log.go:34] Replaying file id: 2165 at offset: 51302477
I0519 12:11:08.064753 1 log.go:34] Replay took: 1.104368974s
I0519 12:11:08.065946 1 log.go:34] Replaying file id: 2166 at offset: 0
I0519 12:11:09.199184 1 log.go:34] Replay took: 1.133202179s
I0519 12:11:09.243771 1 worker.go:96] Worker listening at address: [::]:7080
I0519 12:11:09.244201 1 groups.go:104] Current Raft Id: 0x1
I0519 12:11:09.244899 1 run.go:477] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
The logs show that opening badger took around 6 seconds which includes the time taken to open the tables and replay the value log which is fast enough. So it looks like badger was ready at 12:11:09.199184 and then alpha started its processing.
In this new cluster, given enough time I was able to get myself in a roughly similar scenario, however it does seem like the badger replays are making advances, although extremely slowly since it oom kills. It’s gotten through 1000 file ids in 12 hours.
I don’t see anything than ran for 12 hours. Did I miss something? The replay in badger was finished within 6 seconds.
I also see the following in your logs
I0519 12:09:31.503582 18 draft.go:788] Blocked pushing to applyCh for 23.919s
I0519 12:09:56.112630 18 draft.go:788] Blocked pushing to applyCh for 24.068s
I0519 12:10:20.234184 18 draft.go:788] Blocked pushing to applyCh for 23.598s
I0519 12:10:47.091160 18 draft.go:788] Blocked pushing to applyCh for 25.838s
I0519 12:12:16.218026 1 draft.go:788] Blocked pushing to applyCh for 22.466s
I0519 12:12:39.619851 1 draft.go:788] Blocked pushing to applyCh for 22.695s
I0519 12:13:06.377018 1 draft.go:788] Blocked pushing to applyCh for 25.768s
I believe @ashishgoswami is already working on fixing this and it should be resolved soon.
slafferty-dgraph-alpha 2020/05/19 02:34:34 read p/017024.sst: bad address
This error seems to be the Error 14 on Linux systems and happens when there’s any issue with accessing the disk. Since you’re using badger.tables=disk, there should be heavy activity on your disk. Can you monitor the disk usage as well? Maybe this is some intermittent error and gets resolved automatically?
Hey @JimWen , can you please create an issue on GitHub - dgraph-io/dgraph: The high-performance database for modern applications with all the information you have? We would need the memory profile and alpha logs. If you can share your p directory that would be perfect. We need the p directory to figure out what’s eating all the ram. You can tag jarifibrahim on Github and I’ll find someone to look at the issue right away.
Ah sorry @ibrahim is the WAL in --badger.tables=disk just a single file id? I was accustomed to it iterating over lots of file ids before I had that flag. So yes, by using badger tables on disk + L0OnDisk it was only that single 6 second read (as opposed to what I was seeing before in the earlier logs in this thread of 100+ serial replays each taking over 3s).
I attached the logs from one restart over the 12 hour span- I haven’t keep many of the previous runs since the kube logs are ephemeral. Good news is that it seems like the WAL replay is no longer the slow part (as you’ve pointed out). Given that, perhaps my bottleneck is now at the dgraph alpha level and not badger replays. Unfortunately all nodes are still at max mem after 2 days of no load on them system.
In regards to disk pressure, here’s the alpha over time
@seanlaff Badger has two kinds of files. vlog and sst files. vlog is the write-ahead-log (which we replay) and it also stores values (if they’re bigger than a threshold). The flag --badger.tables=disk means that open all sst files in FileIO mode which can be translated as Read the data from the disk. The default mode for sst file is memory-map which would bring the file in memory and swap out when needed. (There’s a video on youtube if you want to learn more about badger)
The interesting bit is about how golang manages the memory. Even if dgraph were to release the memory, the go runtime would hold it back so that it’s easier for the runtime when the next allocation happens. You should look at the Self-managed cluster - Deploy exposed by dgraph. How much memory does the go_memstats_heap_idle_bytes show? This is the memory runtime is holding which could be released to the operating system.
So yes, it looks like for alpha1 and alpha2 go has reserved all the memory, but is actively using half. These two alphas stay steady at this state and rarely (if ever) oomkill.
The alpha-0 as you can see is oom killed often and has big spikes to max mem, preventing the cluster from recovering.
So I think the question is what causes the big spike on the alpha0 that results in oomkills that doesn’t seem to affect the other two alphas- and can it be mitigated.
top of heap is as followings at at beginning of resart
(pprof) top
Showing nodes accounting for 14.13GB, 99.68% of 14.18GB total
Dropped 83 nodes (cum <= 0.07GB)
Showing top 10 nodes out of 64
flat flat% sum% cum cum%
7.57GB 53.38% 53.38% 10.88GB 76.76% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
3.31GB 23.38% 76.76% 3.31GB 23.38% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
1.71GB 12.08% 88.84% 1.71GB 12.08% go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
0.86GB 6.08% 94.92% 0.86GB 6.08% github.com/DataDog/zstd.Decompress
0.25GB 1.77% 96.69% 0.25GB 1.77% github.com/dgraph-io/ristretto.newCmRow
0.16GB 1.15% 97.84% 0.16GB 1.15% github.com/dgraph-io/badger/v2/skl.newArena
0.13GB 0.89% 98.73% 0.13GB 0.89% github.com/dgraph-io/ristretto/z.(*Bloom).Size
0.08GB 0.53% 99.26% 0.08GB 0.53% github.com/dgraph-io/badger/v2/y.(*Slice).Resize
0.04GB 0.28% 99.54% 1.84GB 12.95% github.com/dgraph-io/dgraph/raftwal.(*DiskStorage).allEntries.func1
0.02GB 0.14% 99.68% 0.88GB 6.23% github.com/dgraph-io/badger/v2/table.(*Table).block
and the memory usage start to continueously grow, and then top of heap is
(pprof) top
Showing nodes accounting for 24883.52MB, 99.51% of 25006.19MB total
Dropped 89 nodes (cum <= 125.03MB)
Showing top 10 nodes out of 58
flat flat% sum% cum cum%
12240.50MB 48.95% 48.95% 12240.50MB 48.95% go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
7751.06MB 31.00% 79.95% 11145.16MB 44.57% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
3394.10MB 13.57% 93.52% 3394.10MB 13.57% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
883.45MB 3.53% 97.05% 883.45MB 3.53% github.com/DataDog/zstd.Decompress
257.58MB 1.03% 98.08% 257.58MB 1.03% github.com/dgraph-io/ristretto.newCmRow
166.41MB 0.67% 98.75% 166.41MB 0.67% github.com/dgraph-io/badger/v2/skl.newArena
128.79MB 0.52% 99.26% 128.79MB 0.52% github.com/dgraph-io/ristretto/z.(*Bloom).Size
40.63MB 0.16% 99.43% 12367.03MB 49.46% github.com/dgraph-io/dgraph/raftwal.(*DiskStorage).allEntries.func1
20.50MB 0.082% 99.51% 903.95MB 3.61% github.com/dgraph-io/badger/v2/table.(*Table).block
0.50MB 0.002% 99.51% 11146.66MB 44.58% github.com/dgraph-io/badger/v2/table.OpenTable
So i guess the oom is caused by raftpb.(*Entry).Unmarshal, add log to track the process, and it shows that oom is cause by loading all entries once into memory here
To solve this problem, we should fix the problem in raft here and it’s lucky that we can ref a similar commit before in dgraph-io/dgraph@69d9403 which works fine until now.