Alpha node running OOM

Dgraph version: v21.03.1

I am trying to load Wikidata. Dataset is about 700g and 6_000_000_000 records.
I tried to load it using Bulk but faced with oom issue. I’ve created a separate topic ‘OOM during bulk loading

Records example:

<https://www.wikidata.org/wiki/Special:EntityData/L25> <http://schema.org/about> <http://www.wikidata.org/entity/L25> .
<https://www.wikidata.org/wiki/Special:EntityData/L25> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> .
<http://www.wikidata.org/entity/L314-F1> <http://wikiba.se/ontology#grammaticalFeature> <http://www.wikidata.org/entity/Q146786> .
<http://www.wikidata.org/entity/L314-F1> <http://www.w3.org/ns/lemon/ontolex#representation> "pis"@ca .
<http://www.wikidata.org/entity/L314-F1> <http://www.w3.org/2000/01/rdf-schema#label> "pis"@ca .

Schema example:

<http://www.w3.org/ns/lemon/ontolex#representation>: string @lang .
<http://www.w3.org/2000/01/rdf-schema#label>: string @lang .
<http://www.wikidata.org/prop/direct/P5831>: string @lang .
<http://wikiba.se/ontology#lemma>: string @lang .
<http://www.wikidata.org/prop/statement/P5831>: string @lang .
<http://www.w3.org/2004/02/skos/core#definition>: string @lang .
<http://www.wikidata.org/prop/direct/P5187>: string @lang .
<http://www.wikidata.org/prop/statement/P5187>: string @lang .

Cluster set up:

Machine is configured with 750G of RAM.

1_200_000_000 records were loaded successfully but then OOM occurred.

1003 17:09:18.329158      22 log.go:34] 2 [term: 904] received a MsgVote message with higher term from 1 [term: 905]
I1003 17:09:18.329199      22 log.go:34] 2 became follower at term 905
I1003 17:09:18.329215      22 log.go:34] 2 [logterm: 904, index: 2512553, vote: 0] cast MsgVote for 1 [logterm: 904, index: 2512553] at term 905
I1003 17:09:18.329875      22 log.go:34] raft.node: 2 elected leader 1 at term 905
I1003 17:09:18.517119      22 draft.go:211] Operation started with id: opSnapshot
I1003 17:09:18.517165      22 draft.go:136] Operation completed with id: opRollup
I1003 17:09:18.517203      22 snapshot.go:295] Got StreamSnapshot request: context:<id:3 group:1 addr:"alpha2:7081" > index:2501003 read_ts:2909272 since_ts:2784958
I1003 17:09:18.517295      22 snapshot.go:204] Waiting to reach timestamp: 2909272
I1003 17:09:18.519573      22 log.go:34] Sending Snapshot Streaming about 88 GiB of uncompressed data (43 GiB on disk)
I1003 17:09:18.644671      22 log.go:34] Number of ranges found: 17
I1003 17:09:18.644712      22 log.go:34] Sent range 0 for iteration: [000000000000000000002d687474703a2f2f7777772e77332e6f72672f323030342f30322f736b6f732f636f726523707265664c6162656c00000000001d72f138ffffffffffe4f3fc, 0000000000000000000033687474703a2f2f7777772e77696b69646174612e6f72672f70726f702f6469726563742d6e6f726d616c697a65642f5033353600000000001f6037ffffffffffffdb0c86) of size: 18 GiB
I1003 17:09:18.644728      22 log.go:34] Sent range 1 for iteration: [0000000000000000000029687474703a2f2f7777772e77696b69646174612e6f72672f70726f702f6469726563742f503238363000000000001ca34b8bffffffffffe5bf9e, 000000000000000000002a687474703a2f2f7777772e77332e6f72672f323030302f30312f7264662d736368656d61236c6162656c0000000000179ee3aeffffffffffe89a91) of size: 18 GiB
I1003 17:09:18.644738      22 log.go:34] Sent range 2 for iteration: [000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e000000000017260b96ffffffffffea84c5, 000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e00000000001d931c8affffffffffe2ac97) of size: 18 GiB
I1003 17:09:18.644747      22 log.go:34] Sent range 3 for iteration: [0000000000000000000017687474703a2f2f736368656d612e6f72672f61626f757400000000001f9c4469ffffffffffd6ce5d, 000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e0000000000138a3276fffffffffff5c37d) of size: 18 GiB
I1003 17:09:18.644755      22 log.go:34] Sent range 4 for iteration: [000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e000000000015bbc889ffffffffffeeabda, 000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e000000000017260b96ffffffffffea84c5) of size: 18 GiB
I1003 17:09:18.644767      22 log.go:34] Sent range 5 for iteration: [000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e00000000001d931c8affffffffffe2ac97, 000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e00000000001ead66efffffffffffdede8b) of size: 18 GiB
I1003 17:09:18.644774      22 log.go:34] Sent range 6 for iteration: [000000000000000000002a687474703a2f2f7777772e77332e6f72672f323030302f30312f7264662d736368656d61236c6162656c0000000000179ee3aeffffffffffe89a91, 000000000000000000002c687474703a2f2f7777772e77332e6f72672f323030342f30322f736b6f732f636f726523616c744c6162656c00000000001e3d4516ffffffffffe08a6c) of size: 18 GiB
I1003 17:09:18.644780      22 log.go:34] Sent range 7 for iteration: [000000000000000000002c687474703a2f2f7777772e77332e6f72672f323030342f30322f736b6f732f636f726523616c744c6162656c00000000001e3d4516ffffffffffe08a6c, 000000000000000000002d687474703a2f2f7777772e77332e6f72672f323030342f30322f736b6f732f636f726523707265664c6162656c00000000001d72f138ffffffffffe4f3fc) of size: 18 GiB
I1003 17:09:18.644789      22 log.go:34] Sent range 8 for iteration: [000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e00000000001f01b105ffffffffffddc1c5, 000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e00000000001f632fc2ffffffffffdb2e3e) of size: 18 GiB
I1003 17:09:18.644796      22 log.go:34] Sent range 9 for iteration: [0000000000000000000027687474703a2f2f7777772e77696b69646174612e6f72672f70726f702f6469726563742f50323100000000001ef54607ffffffffffd71c3d, 0000000000000000000028687474703a2f2f7777772e77696b69646174612e6f72672f70726f702f6469726563742f5036323500000000001e40f38affffffffffe1bb82) of size: 18 GiB
I1003 17:09:18.644805      22 log.go:34] Sent range 10 for iteration: [, 0000000000000000000017687474703a2f2f736368656d612e6f72672f61626f757400000000001f9c4469ffffffffffd6ce5d) of size: 18 GiB
I1003 17:09:23.054402      22 admin.go:868] namespace: 0. No GraphQL schema in Dgraph; serving empty GraphQL API
I1003 17:09:23.519663      22 log.go:34] Sending Snapshot [05s] Scan (8): ~880.0 MiB/88 GiB at 0 B/sec. Sent: 144.0 MiB at 0 B/sec. jemalloc: 1.1 GiB
I1003 17:09:29.253545      22 log.go:34] Sending Snapshot [10s] Scan (8): ~1.3 GiB/88 GiB at 42 MiB/sec. Sent: 192.0 MiB at 4.2 MiB/sec. jemalloc: 1.0 GiB
I1003 17:09:33.519747      22 log.go:34] Sending Snapshot [15s] Scan (8): ~4.7 GiB/88 GiB at 299 MiB/sec. Sent: 352.0 MiB at 15 MiB/sec. jemalloc: 1.0 GiB
I1003 17:09:34.194883      22 log.go:34] Sent range 11 for iteration: [0000000000000000000028687474703a2f2f7777772e77696b69646174612e6f72672f70726f702f6469726563742f5036323500000000001e40f38affffffffffe1bb82, 0000000000000000000029687474703a2f2f7777772e77696b69646174612e6f72672f70726f702f6469726563742f503238363000000000001ca34b8bffffffffffe5bf9e) of size: 18 GiB
I1003 17:09:35.326957      22 log.go:34] Sent range 12 for iteration: [000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e000000000014461d9cfffffffffff1e03d, 000000000000000000001d687474703a2f2f736368656d612e6f72672f6465736372697074696f6e000000000015bbc889ffffffffffeeabda) of size: 18 GiB
fatal error: out of memory

goroutine 29553 [running]:
runtime.throw(0x20e9612, 0xd)
        /usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc002599468 sp=0xc002599438 pc=0xa6fa32
github.com/dgraph-io/ristretto/z.Calloc(0x3739, 0x20ed2eb, 0x10, 0x3739, 0x2, 0x0)
        /go/pkg/mod/github.com/dgraph-io/ristretto@v0.0.4-0.20210504190834-0bf2acd73aa3/z/calloc_jemalloc.go:78 +0x1c7 fp=0xc0025994a8 sp=0xc002599468 pc=0xec7ae7
github.com/dgraph-io/badger/v3/table.(*Table).decompress(0xc006545680, 0xc01663f420, 0x698, 0x1)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/table.go:810 +0x545 fp=0xc002599580 sp=0xc0025994a8 pc=0xfbaa45
github.com/dgraph-io/badger/v3/table.(*Table).block(0xc006545680, 0x698, 0x1, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/table.go:580 +0x37a fp=0xc002599768 sp=0xc002599580 pc=0xfb8a5a
github.com/dgraph-io/badger/v3/table.(*Iterator).next(0xc0001749a0)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/iterator.go:331 +0xd0 fp=0xc0025997b8 sp=0xc002599768 pc=0xfb42d0
github.com/dgraph-io/badger/v3/table.(*Iterator).next(0xc0001749a0)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/iterator.go:348 +0x1ed fp=0xc002599808 sp=0xc0025997b8 pc=0xfb43ed
github.com/dgraph-io/badger/v3/table.(*Iterator).Next(0xc0001749a0)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/iterator.go:406 +0x3e fp=0xc002599820 sp=0xc002599808 pc=0xfb485e
github.com/dgraph-io/badger/v3/table.(*ConcatIterator).Next(0xc0032d1db0)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/iterator.go:529 +0x33 fp=0xc002599880 sp=0xc002599820 pc=0xfb5133
github.com/dgraph-io/badger/v3/table.(*node).next(0xc0024e30c0)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/merge_iterator.go:81 +0x3d fp=0xc002599898 sp=0xc002599880 pc=0xfb579d
github.com/dgraph-io/badger/v3/table.(*MergeIterator).Next(0xc0024e3080)
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/table/merge_iterator.go:157 +0x38 fp=0xc002599900 sp=0xc002599898 pc=0xfb5b98

I decided to increase badger memory configuration and added to each alpha

--badger="memtablesize=67108864000;basetablesize=209715200;baselevelsize=10485760;valuethreshold=1048576;blocksize=4960;blockcachesize=6979321850;indexcachesize=3758096380"

After that all alphas started to consume a lot of memory one by one, approximately (100G of RAM monitored by docker stats) and then failed with another error but unfortunately i lost logs:

It was smth like: slice bounds out of range

Next day i tried to reproduce it but alpha started to run OOM again while docker stats showing that alpha max consumes 10G of RAM.

I1005 12:24:14.720143      24 log.go:295] Found file: 113 First Index: 2498793
I1005 12:24:14.720194      24 log.go:295] Found file: 114 First Index: 2525100
I1005 12:24:14.720234      24 log.go:295] Found file: 115 First Index: 2544449
I1005 12:24:14.720280      24 log.go:295] Found file: 116 First Index: 2563455
I1005 12:24:14.720379      24 storage.go:125] Init Raft Storage with snap: 2501003, first: 2501004, last: 2582169
I1005 12:24:14.720412      24 server_state.go:140] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x33e3080 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864000 BaseTableSize:2097152000 BaseLevelSize:1048576000 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4960 BloomFalsePositive:0.01 BlockCacheSize:697932185000 IndexCacheSize:375809638000 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x20f88bd, 0x16)
        /usr/local/go/src/runtime/panic.go:1117 +0x72
runtime.sysMap(0xd6a0000000, 0x1450000000, 0x33ea730)
        /usr/local/go/src/runtime/mem_linux.go:169 +0xc6
runtime.(*mheap).sysAlloc(0x31a7280, 0x1450000000, 0xa637d7, 0x31a7288)
        /usr/local/go/src/runtime/malloc.go:729 +0x1e5
runtime.(*mheap).grow(0x31a7280, 0xa28000, 0x0)
        /usr/local/go/src/runtime/mheap.go:1346 +0x85
runtime.(*mheap).allocSpan(0x31a7280, 0xa28000, 0x100, 0x7f6a92019558)
        /usr/local/go/src/runtime/mheap.go:1173 +0x609
runtime.(*mheap).alloc.func1()
        /usr/local/go/src/runtime/mheap.go:910 +0x59
runtime.systemstack(0x7f6a2c000020)
        /usr/local/go/src/runtime/asm_amd64.s:379 +0x66
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1246

Also i tried to check ‘p’ directory using dgraph debug but faced with error:

root@8ce6a4aee9af:/dgraph# dgraph debug --postings p
Opening DB: p
Listening for /debug HTTP requests at port: 8080
2021/10/05 12:34:01 Arena too small, toWrite:27 newTotal:87241488 limit:87241465
github.com/dgraph-io/badger/v3/y.AssertTruef
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/y/error.go:62
github.com/dgraph-io/badger/v3/skl.(*Arena).putNode
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/skl/arena.go:67
github.com/dgraph-io/badger/v3/skl.newNode
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/skl/skl.go:109
github.com/dgraph-io/badger/v3/skl.(*Skiplist).Put
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/skl/skl.go:306
github.com/dgraph-io/badger/v3.(*memTable).replayFunction.func1
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/memtable.go:258
github.com/dgraph-io/badger/v3.(*logFile).iterate
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/memtable.go:539
github.com/dgraph-io/badger/v3.(*memTable).UpdateSkipList
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/memtable.go:219
github.com/dgraph-io/badger/v3.(*DB).openMemTable
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/memtable.go:146
github.com/dgraph-io/badger/v3.(*DB).openMemTables
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/memtable.go:88
github.com/dgraph-io/badger/v3.Open
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/db.go:338
github.com/dgraph-io/badger/v3.OpenManaged
        /go/pkg/mod/github.com/dgraph-io/badger/v3@v3.0.0-20210405181011-d918b9904b2a/managed_db.go:26
github.com/dgraph-io/dgraph/dgraph/cmd/debug.run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/debug/run.go:917
github.com/dgraph-io/dgraph/dgraph/cmd/debug.init.0.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/debug/run.go:83
github.com/spf13/cobra.(*Command).execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
main.main
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:99
runtime.main
        /usr/local/go/src/runtime/proc.go:225
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1371

Files in p directory sorted by max size

total 48G
-rw-r--r-- 1 root  root   3.7G Oct  4 18:18 00002.mem
-rw-r--r-- 1 root  root   2.0G Oct  4 17:56 002140.vlog
-rw-r--r-- 1 root  root   298M Oct  4 17:57 101352.sst
-rw-r--r-- 1 root  root   296M Oct  4 17:57 101393.sst
-rw-r--r-- 1 root  root   249M Oct  4 17:57 101353.sst
-rw-r--r-- 1 root  root   190M Oct  4 17:56 101287.sst
-rw-r--r-- 1 root  root   190M Oct  4 17:56 101294.sst
-rw-r--r-- 1 root  root   157M Oct  4 17:57 101371.sst
-rw-r--r-- 1 root  root   143M Oct  4 17:57 101378.sst
-rw-r--r-- 1 root  root   129M Oct  4 17:56 101288.sst
-rw-r--r-- 1 root  root   118M Oct  4 17:57 101416.sst
-rw-r--r-- 1 root  root   111M Oct  4 17:57 101415.sst
-rw-r--r-- 1 root  root   104M Oct  4 17:57 101417.sst
-rw-r--r-- 1 root  root   104M Oct  4 17:57 101397.sst
-rw-r--r-- 1 root  root   103M Oct  4 17:57 101336.sst
-rw-r--r-- 1 root  root   102M Oct  4 17:57 101391.sst
-rw-r--r-- 1 root  root   102M Oct  4 17:57 101421.sst
-rw-r--r-- 1 root  root    99M Oct  4 17:57 101420.sst
-rw-r--r-- 1 root  root    94M Oct  4 17:57 101402.sst
-rw-r--r-- 1 root  root    94M Oct  4 17:57 101401.sst
-rw-r--r-- 1 root  root    93M Oct  4 17:57 101400.sst
-rw-r--r-- 1 root  root    93M Oct  4 17:56 101296.sst
-rw-r--r-- 1 root  root    92M Oct  4 17:57 101389.sst
-rw-r--r-- 1 root  root    92M Oct  4 17:57 101390.sst
-rw-r--r-- 1 root  root    85M Oct  4 17:57 101399.sst
-rw-r--r-- 1 root  root    84M Oct  4 17:57 101351.sst
-rw-r--r-- 1 root  root    83M Oct  4 17:57 101383.sst
-rw-r--r-- 1 root  root    82M Oct  4 17:56 101298.sst
-rw-r--r-- 1 root  root    82M Oct  4 17:57 101362.sst
-rw-r--r-- 1 root  root    78M Oct  4 17:57 101343.sst
-rw-r--r-- 1 root  root    76M Oct  4 17:57 101382.sst
-rw-r--r-- 1 root  root    76M Oct  4 17:57 101396.sst
-rw-r--r-- 1 root  root    72M Oct  4 17:56 101324.sst
-rw-r--r-- 1 root  root    71M Oct  4 17:57 101341.sst
-rw-r--r-- 1 root  root    71M Oct  4 17:57 101385.sst
-rw-r--r-- 1 root  root    70M Oct  4 17:57 101414.sst
-rw-r--r-- 1 root  root    67M Oct  4 17:57 101388.sst
-rw-r--r-- 1 root  root    67M Oct  4 17:56 101300.sst
-rw-r--r-- 1 root  root    64M Oct  4 17:57 101413.sst
-rw-r--r-- 1 root  root    63M Oct  4 17:57 101395.sst
-rw-r--r-- 1 root  root    59M Oct  4 17:56 101310.sst
-rw-r--r-- 1 root  root    58M Oct  4 17:57 101418.sst
....

I tried to make a memory profile but didn’t understood how to get it after execution. Where is it saved?

Can someone please help to figure out what’s going on?
Thanks

As mentioned here, please collect the profiles.