Dgraph can't idle without being oomkilled after large data ingestion

JimWen · May 21, 2020, 6:08pm

@ibrahim I think this should be a raft problem.

top of heap is as followings at at beginning of resart

(pprof) top
Showing nodes accounting for 14.13GB, 99.68% of 14.18GB total
Dropped 83 nodes (cum <= 0.07GB)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
    7.57GB 53.38% 53.38%    10.88GB 76.76%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
    3.31GB 23.38% 76.76%     3.31GB 23.38%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
    1.71GB 12.08% 88.84%     1.71GB 12.08%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
    0.86GB  6.08% 94.92%     0.86GB  6.08%  github.com/DataDog/zstd.Decompress
    0.25GB  1.77% 96.69%     0.25GB  1.77%  github.com/dgraph-io/ristretto.newCmRow
    0.16GB  1.15% 97.84%     0.16GB  1.15%  github.com/dgraph-io/badger/v2/skl.newArena
    0.13GB  0.89% 98.73%     0.13GB  0.89%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
    0.08GB  0.53% 99.26%     0.08GB  0.53%  github.com/dgraph-io/badger/v2/y.(*Slice).Resize
    0.04GB  0.28% 99.54%     1.84GB 12.95%  github.com/dgraph-io/dgraph/raftwal.(*DiskStorage).allEntries.func1
    0.02GB  0.14% 99.68%     0.88GB  6.23%  github.com/dgraph-io/badger/v2/table.(*Table).block

and the memory usage start to continueously grow, and then top of heap is

(pprof) top
Showing nodes accounting for 24883.52MB, 99.51% of 25006.19MB total
Dropped 89 nodes (cum <= 125.03MB)
Showing top 10 nodes out of 58
      flat  flat%   sum%        cum   cum%
12240.50MB 48.95% 48.95% 12240.50MB 48.95%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
 7751.06MB 31.00% 79.95% 11145.16MB 44.57%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
 3394.10MB 13.57% 93.52%  3394.10MB 13.57%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
  883.45MB  3.53% 97.05%   883.45MB  3.53%  github.com/DataDog/zstd.Decompress
  257.58MB  1.03% 98.08%   257.58MB  1.03%  github.com/dgraph-io/ristretto.newCmRow
  166.41MB  0.67% 98.75%   166.41MB  0.67%  github.com/dgraph-io/badger/v2/skl.newArena
  128.79MB  0.52% 99.26%   128.79MB  0.52%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
   40.63MB  0.16% 99.43% 12367.03MB 49.46%  github.com/dgraph-io/dgraph/raftwal.(*DiskStorage).allEntries.func1
   20.50MB 0.082% 99.51%   903.95MB  3.61%  github.com/dgraph-io/badger/v2/table.(*Table).block
    0.50MB 0.002% 99.51% 11146.66MB 44.58%  github.com/dgraph-io/badger/v2/table.OpenTable

So i guess the oom is caused by raftpb.(*Entry).Unmarshal, add log to track the process, and it shows that oom is cause by loading all entries once into memory here

https://github.com/etcd-io/etcd/blob/a4ada8cb1f1cd7e6504a82e5b6bdf15f4bfd90c1/raft/raft.go#L915

ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit)
if err != nil {
	r.logger.Panicf("unexpected error getting unapplied entries (%v)", err)
}

To solve this problem, we should fix the problem in raft here and it’s lucky that we can ref a similar commit before in dgraph-io/dgraph@69d9403 which works fine until now.

And i have commit a PR to raft(retrieve all entries in batches to prevent OOM by JimWen · Pull Request #11929 · etcd-io/etcd · GitHub). To solve this OOM problem, just modify code to use this PR raft. It works fine in my env now.

You can have a try @seanlaff, good luck.

Topic		Replies	Views
Dgraph Alpha Eating Up All RAM Dgraph	7	505	September 9, 2021
Preventing OOM on alpha when doing large queries Dgraph	11	574	July 21, 2020
Memory use and crashes when live loading Dgraph	11	718	November 3, 2020
When writing data, dgraph takes up too much memory Dgraph area:performance	7	710	January 20, 2021
Continuous increase in memory consumption on self hosted dgraph pods Dgraph Clients kind:question , dgraph	4	979	April 25, 2023

Dgraph can't idle without being oomkilled after large data ingestion

Related Topics