Dgraph can't idle without being oomkilled after large data ingestion

Thanks for the build! I restarted the nodes and paused all ingestion load on them- and while it looks like the replays are going faster, the cluster is unhealthy and not responding to any requests (and eventually oom-killing as it tries to come back online). Heres a 10 min log from one of the alphas before it oom’d. It’s been in a pattern like this for ~30 mins.

dgalphaL0 (111.2 KB)

If theres any traces or higher log levels that would be interesting I can grab those too

Looks like it reached file id: 531 before crashing, starting from 270. If it starts higher than 270 and closer to 500 on the next restart, then after a few OOMs, it should stabilize.

If you could share the p directory with us, that’d also help us figure out if we can optimize something here.

CC: @martinmr @Paras

Hrmm, it looks like we’re restarting at file id 270 each time we oom, here’s logs from two adjacent restarts of the same alpha, it stays unhealthy through these

dgalphafirst (116.6 KB)
dgalphasecond (16.6 KB)

The p dir is fairly large (60gb), but I think I might be able to recreate this same behavior on a smaller dataset. I’ll look into that

grep -B 1 Running dgalpha*
dgalphafirst-I0514 23:35:37.475028      15 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
dgalphafirst:I0514 23:35:37.475121      15 log.go:34] Running for level: 0
--
dgalphafirst-I0514 23:35:44.475297      15 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
dgalphafirst:I0514 23:35:44.475408      15 log.go:34] Running for level: 0
--
dgalphafirst-I0514 23:36:52.854922      15 log.go:34] Got compaction priority: {level:0 score:138.2 dropPrefix:[]}
dgalphafirst:I0514 23:36:52.855418      15 log.go:34] Running for level: 0
--
dgalphasecond-I0514 23:52:06.686791      17 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
dgalphasecond:I0514 23:52:06.686883      17 log.go:34] Running for level: 0
--
dgalphasecond-I0514 23:53:54.932609      17 log.go:34] Got compaction priority: {level:0 score:187.4 dropPrefix:[]}
dgalphasecond:I0514 23:53:54.933253      17 log.go:34] Running for level: 0
grep DONE dgalpha*
dgalphafirst:I0514 23:35:41.786281      15 log.go:34] Compaction for level: 0 DONE
dgalphafirst:I0514 23:35:47.885393      15 log.go:34] Compaction for level: 0 DONE
dgalphasecond:I0514 23:52:10.579006      17 log.go:34] Compaction for level: 0 DONE
tail -n 1 dgalpha*                    
==> dgalphafirst <==
I0514 23:49:11.886245      15 log.go:34] Got compaction priority: {level:1 score:11.04490564391017 dropPrefix:[]}

==> dgalphasecond <==
I0514 23:54:43.787863      17 log.go:34] Replaying file id: 301 at offset: 0

The last compaction in alphafirst was level 0 compaction with a score of 138.2. That compaction started at 23:36 but didn’t complete even at 23:49 (13 minutes) and the alpha crashed. It’s also interesting that there were so many tables at level 0. A score of 138.2 would mean level 0 has 691 tables (138.2*5=691 tables) badger/levels.go at aadda9aa27fdf762cf36e1a150f796fb094fbc10 · dgraph-io/badger · GitHub . My guess is that such a huge compaction caused the OOM. Also, the decompression is known to take a lot of memory. We reduced the memory used by decompression in badger https://github.com/dgraph-io/badger/pull/1308 but it’s not yet in dgraph.

I believe the high number of tables on L0 are a side effect of the change we made some time ago https://github.com/dgraph-io/badger/commit/3747be59ebbbf34382ab8d84207151c7f7014da2

@seanlaff Using FileIO table loading mode dgraph --badger.tables=disk to start dgraph just so that replay can catch up might help. Once the replay is successful, you can kill and start the alpha in memory map mode.

Thanks for the deep dive. I gave the badger disk tables option a shot, but unfortunately not seeing a difference. Here’s logs from two adjacent restarts. It appears its still struggling to break that file id 270 to 500 range.

alphadiskout (121.6 KB)
alphadiskout2 (147.2 KB)

Im around 26gb mem right now, but throwing more hardware at this is an option. Although, my concern is even with more hardware, it doesn’t seem like the duration of the replay will decrease. Is a node considered unhealthy until the replays finish? (That seems to be my experience, as its not responding to my api calls). If so, is there a good HA pattern, as 10min+ boot time is a tough pill to swallow :slight_smile:

I0515 12:19:46.469558 ...Got compaction priority: {level:0 score:276 dropPrefix:[]}
I0515 12:05:03.448493 ...Got compaction priority: {level:0 score:218.2 dropPrefix:[]}

It looks like more and more data is getting accumulated on Level 0.

Is it possible there is a decently sized WAL from the original crash that is fighting with the replaying?

@seanlaff can you get a memory profile around the crash? I can make some changes so that compaction doesn’t eat up all the ram and allows alpha to start. I believe the table building in compaction is causing the crash and a memory profile would confirm my suspicion. I can create a test binary for you which would reduce the memory usage and start the alpha.

Yes, the node is considered unhealthy until the replay completes. The replay is done by badger (the storage layer) and the node cannot serve any requests until the storage layer is ready.

Yes. The WAL replay is causing the issue. Badger is using too much of memory.

Ah cool, makes sense. Is there any tuning exposed for controlling the size of the WAL?

Having trouble grabbing a heap snapshot- it seems the alpha is so loaded it won’t respond to my pprof requests

You can control the size of vlog (badger WAL) by setting valueLogMaxSize or ValueLogMaxEntries but these won’t have any effect until the replay completes. Only the new vlog (badger WAL) files be affected by these options. Not the existing ones.

What is the size of the p directory and the w directory and how many sst and vlog files does each one of them have?

> du -h
16K	./lost+found
3.7G	./w
95G	./p
98G	.
> ls w/*.sst | wc -l
263
> ls w/*.vlog | wc -l
3
> ls p/*.sst | wc -l
4239
> ls p/*.vlog | wc -l
315

Last night this dir was 60gb, and now its up to 100. I haven’t ingested any new data since then, maybe things are getting left around on disk after each restart since it doesn’t seem to be getting through the replay?

@seanlaff The data from vlog files (badger WAL) is being replayed and added into the LSM tree which is supposed to create more .sst files. The increase in on-disk data is expected.

@seanlaff Would you have some time for a call? I would like to make some changes to badger and run it against your data. If you can’t have a call, I can create a docker image for you with some modifications.

Hey @ibrahim sorry for the delay. Sure, that’d be great I’m free anytime

I rebuilt the cluster from scratch (no data) with all the aforementioned changes from the getgo (using the dgraph/dgraph:v20.03.1-l0notinmemory image, --badger.tables=disk, --ludicrous_mode)

On my previous cluster, the logs showed the replay never advancing (it’d oom and restart at the same file id), which is why I wanted to raise the suspicion that the data dir was growing even though replaying was working across the same files over and over again (maybe this is intended and later cleaned up by a compaction step).

In this new cluster, given enough time I was able to get myself in a roughly similar scenario, however it does seem like the badger replays are making advances, although extremely slowly since it oom kills. It’s gotten through 1000 file ids in 12 hours.

Since the nodes are a little more happy than they were before, I was able to get a heap snapshot- although it only shows a few gigs while the process itself has over 24gb.

pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.050.pb.gz (32.4 KB)

Here’s some logs from one the alphas, you can see the slowly advancing file ids

slowalpha (498.1 KB)

Quite rare, but occasionally Ill see a crash like this

slafferty-dgraph-alpha I0519 02:34:27.504530       1 draft.go:788] Blocked pushing to applyCh for 24.084s                                                                      │
slafferty-dgraph-alpha I0519 02:34:27.965769       1 log.go:34] LOG Compact 1->2, del 3 tables, add 2 tables, took 1.658174484s                                                │
slafferty-dgraph-alpha I0519 02:34:27.965876       1 log.go:34] Compaction for level: 1 DONE                                                                                   │
slafferty-dgraph-alpha I0519 02:34:27.965924       1 log.go:34] Got compaction priority: {level:1 score:1.0181028097867966 dropPrefix:[]}                                      │
slafferty-dgraph-alpha I0519 02:34:27.966017       1 log.go:34] Running for level: 1                                                                                           │
slafferty-dgraph-alpha 2020/05/19 02:34:34 read p/017024.sst: bad address                                                                                                      │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2/y.Wrap                                                                                                                   │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/y/error.go:71                                                   │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2/y.Check                                                                                                                  │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/y/error.go:43                                                   │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2/table.(*Table).readNoFail                                                                                                │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/table/table.go:331                                              │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2/table.(*Table).readIndex                                                                                                 │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/table/table.go:362                                              │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2/table.(*Table).initBiggestAndSmallest                                                                                    │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/table/table.go:284                                              │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2/table.OpenTable                                                                                                          │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/table/table.go:250                                              │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2.(*DB).handleFlushTask                                                                                                    │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/db.go:1014                                                      │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2.(*DB).flushMemtable                                                                                                      │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/db.go:1036                                                      │
slafferty-dgraph-alpha github.com/dgraph-io/badger/v2.Open.func4                                                                                                               │
slafferty-dgraph-alpha     /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200316175624-91c31ebe8c22/db.go:361                                                       │
slafferty-dgraph-alpha runtime.goexit                                                                                                                                          │
slafferty-dgraph-alpha     /usr/local/go/src/runtime/asm_amd64.s:1373                                                                                                          │
slafferty-dgraph-alpha stream closed

Trapped in the same problem, and the WAL replay caused this problem. Badger is using too much of memory and dgraph oom after restart.
After trying modify KeepL0InMemory=false/–badger.tables=disk/–badger.vlog=disk, alpha can still not recover. Do you have any idea how to solve this, it is urgent!!!

Hey @seanlaff, apologies for the late reply. I got occupied with something else.

I see this in the logs

I0519 12:11:03.853438       1 log.go:34] 455 tables out of 654 opened in 3.012s
I0519 12:11:05.119154       1 log.go:34] All 654 tables opened in 4.277s
I0519 12:11:06.410063       1 log.go:34] Got compaction priority: {level:1 score:1.0842859745025635 dropPrefix:[]}
I0519 12:11:06.410377       1 log.go:34] Running for level: 1
I0519 12:11:06.911194       1 log.go:34] Got compaction priority: {level:1 score:1.014364082366228 dropPrefix:[]}
I0519 12:11:06.911364       1 log.go:34] Running for level: 1
I0519 12:11:06.960332       1 log.go:34] Replaying file id: 2165 at offset: 51302477
I0519 12:11:08.064753       1 log.go:34] Replay took: 1.104368974s
I0519 12:11:08.065946       1 log.go:34] Replaying file id: 2166 at offset: 0
I0519 12:11:09.199184       1 log.go:34] Replay took: 1.133202179s
I0519 12:11:09.243771       1 worker.go:96] Worker listening at address: [::]:7080
I0519 12:11:09.244201       1 groups.go:104] Current Raft Id: 0x1
I0519 12:11:09.244899       1 run.go:477] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql

The logs show that opening badger took around 6 seconds which includes the time taken to open the tables and replay the value log which is fast enough. So it looks like badger was ready at 12:11:09.199184 and then alpha started its processing.

In this new cluster, given enough time I was able to get myself in a roughly similar scenario, however it does seem like the badger replays are making advances, although extremely slowly since it oom kills. It’s gotten through 1000 file ids in 12 hours.

I don’t see anything than ran for 12 hours. Did I miss something? The replay in badger was finished within 6 seconds.

I also see the following in your logs

I0519 12:09:31.503582      18 draft.go:788] Blocked pushing to applyCh for 23.919s
I0519 12:09:56.112630      18 draft.go:788] Blocked pushing to applyCh for 24.068s
I0519 12:10:20.234184      18 draft.go:788] Blocked pushing to applyCh for 23.598s
I0519 12:10:47.091160      18 draft.go:788] Blocked pushing to applyCh for 25.838s
I0519 12:12:16.218026       1 draft.go:788] Blocked pushing to applyCh for 22.466s
I0519 12:12:39.619851       1 draft.go:788] Blocked pushing to applyCh for 22.695s
I0519 12:13:06.377018       1 draft.go:788] Blocked pushing to applyCh for 25.768s

I believe @ashishgoswami is already working on fixing this and it should be resolved soon.

From the heap profile you shared, I see that Decompression was taking a lot of memory which is fixed in badger and should be soon in dgraph https://github.com/dgraph-io/badger/pull/1308 . A significant chunk of memory was used by table index and we have an issue open on badger to fix this Optimize memory usage of table.TableIndex · Issue #1335 · dgraph-io/badger · GitHub .

slafferty-dgraph-alpha 2020/05/19 02:34:34 read p/017024.sst: bad address                                                                                                      

This error seems to be the Error 14 on Linux systems and happens when there’s any issue with accessing the disk. Since you’re using badger.tables=disk, there should be heavy activity on your disk. Can you monitor the disk usage as well? Maybe this is some intermittent error and gets resolved automatically?

Hey @JimWen , can you please create an issue on GitHub - dgraph-io/dgraph: The high-performance database for modern applications with all the information you have? We would need the memory profile and alpha logs. If you can share your p directory that would be perfect. We need the p directory to figure out what’s eating all the ram. You can tag jarifibrahim on Github and I’ll find someone to look at the issue right away.

Ah sorry @ibrahim is the WAL in --badger.tables=disk just a single file id? I was accustomed to it iterating over lots of file ids before I had that flag. So yes, by using badger tables on disk + L0OnDisk it was only that single 6 second read (as opposed to what I was seeing before in the earlier logs in this thread of 100+ serial replays each taking over 3s).

I attached the logs from one restart over the 12 hour span- I haven’t keep many of the previous runs since the kube logs are ephemeral. Good news is that it seems like the WAL replay is no longer the slow part (as you’ve pointed out). Given that, perhaps my bottleneck is now at the dgraph alpha level and not badger replays. Unfortunately all nodes are still at max mem after 2 days of no load on them system.

In regards to disk pressure, here’s the alpha over time

Its a 1tb ssd with estimated performance of

Sustained random IOPS limit	30,720.00	30,000.00
Sustained throughput limit (MB/s)	491.52	491.52

I see that bad address panic occasionally- but it may be a transient disk thing.

@seanlaff Badger has two kinds of files. vlog and sst files. vlog is the write-ahead-log (which we replay) and it also stores values (if they’re bigger than a threshold). The flag --badger.tables=disk means that open all sst files in FileIO mode which can be translated as Read the data from the disk. The default mode for sst file is memory-map which would bring the file in memory and swap out when needed. (There’s a video on youtube if you want to learn more about badger)

The interesting bit is about how golang manages the memory. Even if dgraph were to release the memory, the go runtime would hold it back so that it’s easier for the runtime when the next allocation happens. You should look at the Self-managed cluster - Deploy exposed by dgraph. How much memory does the go_memstats_heap_idle_bytes show? This is the memory runtime is holding which could be released to the operating system.

Ah gotchya @ibrahim thanks for the clarification.

Here’s the go memstats profile.

So yes, it looks like for alpha1 and alpha2 go has reserved all the memory, but is actively using half. These two alphas stay steady at this state and rarely (if ever) oomkill.

The alpha-0 as you can see is oom killed often and has big spikes to max mem, preventing the cluster from recovering.

So I think the question is what causes the big spike on the alpha0 that results in oomkills that doesn’t seem to affect the other two alphas- and can it be mitigated.

@ibrahim I think this should be a raft problem.

top of heap is as followings at at beginning of resart

(pprof) top
Showing nodes accounting for 14.13GB, 99.68% of 14.18GB total
Dropped 83 nodes (cum <= 0.07GB)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
    7.57GB 53.38% 53.38%    10.88GB 76.76%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
    3.31GB 23.38% 76.76%     3.31GB 23.38%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
    1.71GB 12.08% 88.84%     1.71GB 12.08%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
    0.86GB  6.08% 94.92%     0.86GB  6.08%  github.com/DataDog/zstd.Decompress
    0.25GB  1.77% 96.69%     0.25GB  1.77%  github.com/dgraph-io/ristretto.newCmRow
    0.16GB  1.15% 97.84%     0.16GB  1.15%  github.com/dgraph-io/badger/v2/skl.newArena
    0.13GB  0.89% 98.73%     0.13GB  0.89%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
    0.08GB  0.53% 99.26%     0.08GB  0.53%  github.com/dgraph-io/badger/v2/y.(*Slice).Resize
    0.04GB  0.28% 99.54%     1.84GB 12.95%  github.com/dgraph-io/dgraph/raftwal.(*DiskStorage).allEntries.func1
    0.02GB  0.14% 99.68%     0.88GB  6.23%  github.com/dgraph-io/badger/v2/table.(*Table).block

and the memory usage start to continueously grow, and then top of heap is

(pprof) top
Showing nodes accounting for 24883.52MB, 99.51% of 25006.19MB total
Dropped 89 nodes (cum <= 125.03MB)
Showing top 10 nodes out of 58
      flat  flat%   sum%        cum   cum%
12240.50MB 48.95% 48.95% 12240.50MB 48.95%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
 7751.06MB 31.00% 79.95% 11145.16MB 44.57%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
 3394.10MB 13.57% 93.52%  3394.10MB 13.57%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
  883.45MB  3.53% 97.05%   883.45MB  3.53%  github.com/DataDog/zstd.Decompress
  257.58MB  1.03% 98.08%   257.58MB  1.03%  github.com/dgraph-io/ristretto.newCmRow
  166.41MB  0.67% 98.75%   166.41MB  0.67%  github.com/dgraph-io/badger/v2/skl.newArena
  128.79MB  0.52% 99.26%   128.79MB  0.52%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
   40.63MB  0.16% 99.43% 12367.03MB 49.46%  github.com/dgraph-io/dgraph/raftwal.(*DiskStorage).allEntries.func1
   20.50MB 0.082% 99.51%   903.95MB  3.61%  github.com/dgraph-io/badger/v2/table.(*Table).block
    0.50MB 0.002% 99.51% 11146.66MB 44.58%  github.com/dgraph-io/badger/v2/table.OpenTable

So i guess the oom is caused by raftpb.(*Entry).Unmarshal, add log to track the process, and it shows that oom is cause by loading all entries once into memory here

https://github.com/etcd-io/etcd/blob/a4ada8cb1f1cd7e6504a82e5b6bdf15f4bfd90c1/raft/raft.go#L915

ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit)
if err != nil {
	r.logger.Panicf("unexpected error getting unapplied entries (%v)", err)
}

To solve this problem, we should fix the problem in raft here and it’s lucky that we can ref a similar commit before in dgraph-io/dgraph@69d9403 which works fine until now.

And i have commit a PR to raft(retrieve all entries in batches to prevent OOM by JimWen · Pull Request #11929 · etcd-io/etcd · GitHub). To solve this OOM problem, just modify code to use this PR raft. It works fine in my env now.

You can have a try @seanlaff, good luck.

1 Like