Bulkload OOM when loading big dataset

Moved from GitHub dgraph/5574

Posted by JimWen:

What version of Dgraph are you using?

Dgraph version : v20.03.3
Dgraph SHA-256 : 08424035910be6b6720570427948bab8352a0b5a6d59a0d20c3ec5ed29533121
Commit SHA-1 : fa3c19120
Commit timestamp : 2020-06-02 16:47:25 -0700
Branch : HEAD
Go version : go1.14.1

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

128G mem & 1.8T SSD

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

Steps to reproduce the issue (command/config used to run Dgraph).

Bulkload big dataset

Expected behaviour and actual result.

Log is as followings:

[14:50:58+0800] REDUCE 02h56m55s 28.96% edge_count:2.924G edge_speed:879.5k/sec plist_count:1.837G plist_speed:552.4k/sec. Num Encoding: 0
[14:50:59+0800] REDUCE 02h56m56s 28.96% edge_count:2.924G edge_speed:879.3k/sec plist_count:1.837G plist_speed:552.3k/sec. Num Encoding: 0
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x195a44b, 0x16)
        /usr/local/go/src/runtime/panic.go:1114 +0x72
runtime.sysMap(0xddd4000000, 0xc8000000, 0x2834d78)
        /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x281f840, 0xc5800000, 0x281f848, 0x62b4a)
        /usr/local/go/src/runtime/malloc.go:715 +0x1cd
runtime.(*mheap).grow(0x281f840, 0x62b4a, 0x0)
        /usr/local/go/src/runtime/mheap.go:1286 +0x11c
runtime.(*mheap).allocSpan(0x281f840, 0x62b4a, 0x0, 0x2834d88, 0xfffffffffffffade)
        /usr/local/go/src/runtime/mheap.go:1124 +0x6a0
runtime.(*mheap).alloc.func1()
        /usr/local/go/src/runtime/mheap.go:871 +0x64
runtime.(*mheap).alloc(0x281f840, 0x62b4a, 0xc000000100, 0xc6a8164180)
        /usr/local/go/src/runtime/mheap.go:865 +0x81
runtime.largeAlloc(0xc5694000, 0xc6bd610001, 0x4)
        /usr/local/go/src/runtime/malloc.go:1152 +0x92
runtime.mallocgc.func1()
        /usr/local/go/src/runtime/malloc.go:1047 +0x46
runtime.systemstack(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1041

goroutine 6534 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0006e8bf8 sp=0xc0006e8bf0 pc=0x9b16b0
runtime.mallocgc(0xc5694000, 0x172afa0, 0xc6a8bdfe01, 0x9ad6f0)
        /usr/local/go/src/runtime/malloc.go:1046 +0x895 fp=0xc0006e8c98 sp=0xc0006e8bf8 pc=0x957fb5
runtime.growslice(0x172afa0, 0xdc9429a000, 0x68abeb8, 0x69492aa, 0x697152d, 0xae1a8, 0xc6555, 0xae1a8)
        /usr/local/go/src/runtime/slice.go:181 +0x1e2 fp=0xc0006e8d00 sp=0xc0006e8c98 pc=0x998c52
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).reduce(0xcbc1d060f0, 0xc6e13a8000, 0x4ede, 0x5000, 0xc6e0b50000, 0x7e3, 0x900, 0xc6ba306040)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:364 +0x48f fp=0xc0006e8e50 sp=0xc0006e8d00 pc=0x156394f
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).run.func1(0xc00039e700, 0xc6a54e4000, 0x3, 0x64, 0xcbc1d060f0, 0x0, 0xd6a0c87400)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:91 +0x53e fp=0xc0006e8fa8 sp=0xc0006e8e50 pc=0x1568d4e
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc0006e8fb0 sp=0xc0006e8fa8 pc=0x9b37c1
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:67 +0x240

goroutine 1 [select, 55 minutes]:
github.com/dgraph-io/badger/v2/y.(*Throttle).Do(0xc00039e700, 0x19b72c8, 0xc00039e700)
        /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/y/y.go:267 +0xb9
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).run(0xcbc1d060f0, 0xcbc1d060f0, 0x19456b7)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:64 +0x273
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).reduceStage(0xc0007401b0)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:299 +0x8e
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.run()
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:241 +0xd99
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.init.0.func1(0xc00018e780, 0xc00001e0b0, 0x0, 0xb)
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:49 +0x62
github.com/spf13/cobra.(*Command).execute(0xc00018e780, 0xc00001e000, 0xb, 0xb, 0xc00018e780, 0xc00001e000)
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0x2616600, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute()
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70 +0x80
main.main()
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78 +0xb0

is seems to be caused by loading all the predicate keys into mem at reduce phase

JimWen commented :

here is the heap

Showing top 10 nodes out of 41
      flat  flat%   sum%        cum   cum%
 2776.80MB 27.05% 27.05%  5058.40MB 49.27%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).startBatching
 2279.60MB 22.20% 49.25%  2281.61MB 22.22%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).startBatching.func1
 1963.88MB 19.13% 68.38%  2852.92MB 27.79%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).toList.func2
  573.61MB  5.59% 73.96%   837.12MB  8.15%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  533.14MB  5.19% 79.16%   533.14MB  5.19%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
  531.53MB  5.18% 84.33%   531.53MB  5.18%  github.com/dgraph-io/dgraph/protos/pb.(*PostingList).Marshal
  430.54MB  4.19% 88.53%   430.54MB  4.19%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).reduce
     297MB  2.89% 91.42%      297MB  2.89%  github.com/dgraph-io/badger/v2/y.Copy (inline)
  263.51MB  2.57% 93.99%   263.51MB  2.57%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
     138MB  1.34% 95.33%      138MB  1.34%  github.com/dgraph-io/badger/v2/table.NewTableBuilder

ashish-goswami commented :

Might be related to Inconsistent bulk loader failures · Issue #5361 · dgraph-io/dgraph · GitHub

ashish-goswami commented :

Hi @JimWen, as mentioned in #5361, can you please repeat bulk loading with v1.1.1, if possible. Let us know if bulk load is successful or not.

JimWen commented :

Hi @JimWen, as mentioned in #5361, can you please repeat bulk loading with v1.1.1, if possible. Let us know if bulk load is successful or not.

ok i will try it later

JimWen commented :

@ashish-goswami v1.1.1 is still oom, but it oom later compared with this version where reduce progress is much greater.

harshil-goel commented :

Can you try increasing the number of shards (both map and reduce) and check again? It would be really helpful.