Bulk loader still OOM during reduce phase

What I want to do

Import an existing data set that consists of around 350M RDF quads. There are a lot of uids but I don’t have precise statistics on those.

There are also a large number of unique predicates… I don’t have an exact count, but when I tried to create a schema file that referenced them all (including types), it was about 30GiB uncompressed.

What I did

dgraph bulk --schema bulk-atl/out.schema --files bulk-atl --num_go_routines 1 --mapoutput_mb 1024 --zero dgraph-zero-2.dgraph-zero:5080 --tmp tmp --partition_mb 2

and various alternatives of the above with different sizes for the map files (down to 64MB) and the default --partition_mb value.

I have found earlier reports that have been addressed in the latest builds with the addition of jemalloc, but my build has that and I have failed to import none-the-less.

It would be important to note that I running in a kubernetes cluster monopolizing progressively larger nodes to where my most recent tests have been running on vms with 256GiB RAM; however, this doesn’t seem to matter as the process dies with

runtime: out of memory: cannot allocate 4194304-byte block (83409764352 in use)

When usage is at around 100GiB

I understand that this may be systems management issue, but while I pursue that potential problem, I did want to get some comments on the feasibility and resource requirements of importing a data set like mine.

I have not yet tried to increase the number of map or reduce shards as suggested in other discussions since the help docs state that can only increase memory, but I will try that while waiting on a response for completeness.

Thanks for any help.

Dgraph metadata

dgraph bulk identifying logs graph version : v21.03.1 Dgraph codename : rocket-1 Dgraph SHA-256 : a00b73d583a720aa787171e43b4cb4dbbf75b38e522f66c9943ab2f0263007fe Commit SHA-1 : ea1cb5f35 Commit timestamp : 2021-06-17 20:38:11 +0530 Branch : HEAD Go version : go1.16.2 jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.

___ Begin jemalloc statistics ___
Version: “5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756”
Build-time option settings
config.cache_oblivious: true
config.debug: false
config.fill: true
config.lazy_lock: false
config.malloc_conf: “background_thread:true,metadata_thp:auto”
config.opt_safety_checks: false
config.prof: true
config.prof_libgcc: true
config.prof_libunwind: false
config.stats: true
config.utrace: false
config.xmalloc: false
Run-time option settings
opt.abort: false
opt.abort_conf: false
opt.confirm_conf: false
opt.retain: true
opt.dss: “secondary”
opt.narenas: 128
opt.percpu_arena: “disabled”
opt.oversize_threshold: 8388608
opt.metadata_thp: “auto”
opt.background_thread: true (background_thread: true)
opt.dirty_decay_ms: 10000 (arenas.dirty_decay_ms: 10000)
opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
opt.lg_extent_max_active_fit: 6
opt.junk: “false”
opt.zero: false
opt.tcache: true
opt.lg_tcache_max: 15
opt.thp: “default”
opt.prof: false
opt.prof_prefix: “jeprof”
opt.prof_active: true (prof.active: false)
opt.prof_thread_active_init: true (prof.thread_active_init: false)
opt.lg_prof_sample: 19 (prof.lg_sample: 0)
opt.prof_accum: false
opt.lg_prof_interval: -1
opt.prof_gdump: false
opt.prof_final: false
opt.prof_leak: false
opt.stats_print: false
opt.stats_print_opts: “”
Profiling settings
prof.thread_active_init: false
prof.active: false
prof.gdump: false
prof.interval: 0
prof.lg_sample: 0
Arenas: 129
Quantum size: 16
Page size: 4096
Maximum thread-cached size class: 32768
Number of bin size classes: 36
Number of thread-cache bin size classes: 41
Number of large size classes: 196
Allocated: 4314440, active: 4362240, metadata: 11125960 (n_thp 0), resident: 15384576, mapped: 23236608, retained: 5599232
Background threads: 4, num_runs: 8, run_interval: 684055750 ns

1 Like

Sorry to hear you are having this trouble - just wanted to give my bulkload experience here:

  • ~3 billion triples
  • 4 input rdf files
    • 5-10GiB gzip compressed each
    • pulled from minio
  • 4 output groups
  • 16core 32GiB machine the bulk loader was run on, never got above 25GiB memory.
  • finishes in about 1h20m
  • v21.03.1
  • I think default flags other than the output groups
1 Like

Do you really mean 30Gb just in the schema?? That doesn’t seem right. That would make it seem like every triple is a unique predicate, this just seems REALLY high for the amount of predicates and the size of the schema alone. We work with what I think is a really large schema but it is only 145Kb at 4,298 lines in GraphQL format including the auth rules. I can’t imaging a 30Gb schema file.

Yes… I mean 30Gb.

For background, I am exporting a dataset from mongo, and for simplicity, rather than chopping up all sub-documents into their own node/dgraph type, we have decided to perform our initial import and analysis by simply flattening the documents into nodes.

i.e. source document like

{
   "this": {"prop": 1, "attribute": 2},
   "is":  {"prop": 1, "attribute": 2},
   "a":  {"prop": 1, "attribute": 2},
   "doc":  {"prop": 1, "attribute": 2},
}

Would be come a node with the predicates:

  • this.prop
  • this.attribute
  • is.prop
  • is.attribute
  • a.prop
  • a.attribute
  • doc.prop
  • doc.attribute

If we combine this with the fact that array in documents are also flattened into the node, the number of predicates really explodes (each type has its own set of predicates with only a handful being shared across types)

The reason my schema file doesn’t include all these is because my data structures keeping track of them in my mongo export logic were causing my workstation to start swapping heavily and eventually run out of virtual memory at around 80GiB, so I decided to just punt and add to the schema manually as our evaluation dictates.

Anyway, I now realize that this approach was excessively naive and will definitely need to be reconsidered, but I was hoping the dataset would still be useful for analyzing the perfomance/functionality of graph lookups where they do happen (we do have foreign key style references in the mongo docs and those are what will convert to node references in dgraph)

If I get the feedback dgraph will struggle with disproportionately large numbers of predicates… especially in the bulk load situation, I can work at restructuring the input data to be more usable.

Impressive - I read that as 30GB in RDF not schema. For comparison on our production system (with the 3bn triples) we currently have 8422 predicates (and growing), and see no issues in sight on that front.

1 Like

Maybe @MichelDiz has experience with such a large schema? I don’t think there is a hard limit on predicate count or schema size, but just processing a schema that would be over 30Gb is :exploding_head:

I still can’t fathom how many predicates that is. Our whole dataset only uses 0.5Gb.

1 Like

To be clear, and in case it matters, the schema definition I am actually using is not that large.
The one I am using omits most predicates and just includes a small set (10) assigned to about 57 different types and is only 6.9k (562 lines).

However, those predicates omitted from the schema are still being used on nodes/rdf specs.

When I was experimenting manually with a small data set, I was able to populate nodes like this, just the data wasn’t indexed, and I had to explicitly indicate the predicates I wanted to see in queries ( expand(_all_) doesn’t work… which is fine for me)

The reason I mentioned the size of the schema that tried to include everything was just to give a general idea of the number of different predicates.

Also I might mention that there may have been a defect in my export code that wrote predicates (and the types that they were referenced by) twice, but even so, that leaves me with a schema of about 15GB.

1 Like

Right, understood that part. But a schema file is still built containing every predicate with either default or first found type. I assume this is part of the bulk loader process.

OK. Thanks.

Nope, I don’t and I’m impressed as you guys. I can’t imagine such a big schema lol.

So, you say here that you have 256GiB RAM available for that instance that is running the bulk-load? So it uses less than 50% do the available RAM? and runs OOM.

Can you share via DM the data if reproducible? I may send this to an engineer.

Pretty much… the machine/instance/VM has 256GiB ram, though my k8s configs only grant 210GiB ram to the container.

Still, the process only gets up to around 100GiB of usage before it reports out of memory and dumps a go stack trace.

When k8s kill the process due to it stepping out of bounds, it more or less just goes away.

I provided sample data via DM… and I’ll mention that I am working on trying to make the number of predicates more efficient by leveraging facets over mangling predicate names.

Thanks for sharing the dataset over DM. Your dataset has about 109 million unique predicate names in the schema. That’s way too much and explains the OOM that’s happening here.

If you can store the “unique” part of the edge as a value (e.g., 0x1> <id> "abc-123-567" .) instead of part of the predicate name then that’d greatly reduce the schema entries here.


Here’s a heap profile of the bulk loader with your data set during the map phase:

File: dgraph
Build ID: 6fee491f94a8999d0e85fe9ea8ccb3321f7ce0d0
Type: inuse_space
Time: Jul 23, 2021 at 2:07pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 44.24GB, 99.88% of 44.30GB total
Dropped 73 nodes (cum <= 0.22GB)
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
      19GB 42.88% 42.88%       19GB 42.88%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*schemaStore).validateType
   14.15GB 31.95% 74.83%    14.15GB 31.95%  github.com/dgraph-io/dgraph/x.NamespaceAttr
   10.83GB 24.45% 99.28%    10.83GB 24.45%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shardMap).shardFor
    0.27GB   0.6% 99.88%     0.27GB   0.6%  bytes.makeSlice
         0     0% 99.88%     0.27GB   0.6%  bytes.(*Buffer).Write
         0     0% 99.88%     0.27GB   0.6%  bytes.(*Buffer).grow
         0     0% 99.88%     0.27GB   0.6%  github.com/dgraph-io/dgraph/chunker.(*rdfChunker).Chunk
         0     0% 99.88%    43.99GB 99.31%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).mapStage.func1
         0     0% 99.88%     0.27GB   0.6%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).mapStage.func2
         0     0% 99.88%       19GB 42.89%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapper).createPostings

Most of the memory usage is due to the large number of predicates in the schema.

Thank you so much for the time you have taken to look in to this @dmai .

I was aware that the number of predicates used was really poorly planned, and I have been looking in to addressing that, but aside from using way too much memory, I was wondering why the program was quitting due to memory reasons when the system had plenty of memory available.

The heap profile shown looks to be sampled during the map phase, but I was able to get past that just by increasing resources, but once the bulk loader progressed to the reduce phase it would panic before available memory was actually consumed (at around 100GiB of usage when 210GiB were available).

My import data has been updated to track variable/high cardinality components as facets, and this was very effective in getting memory usage down during the map phase. Usage hovered at around 1GiB for the entirety of that process.

However, when transitioning to the reduce phase, memory consumption increased fairly quickly and plateaued at around 35GiB… granted it got much further than before (89% vs 45%), but it eventually also panicked with out of memory messages even though the process has 180GiB more ram available.

Could this be due to using facets instead of restructuring the data more extensively with additional nodes and properties? I would like to be able to compare both approaches in other areas of usage.

Interestingly some of the final bits of output read as

Finishing stream id: 70876
Finishing stream id: 70877
Finishing stream id: 70878
[16:47:02Z] REDUCE 01h29m57s 88.42% edge_count:328.5M edge_speed:271.7k/sec plist_count:175.1M plist_speed:144.8k/sec. Num Encoding MBs: 0. jemalloc: 2.6 TiB
Finishing stream id: 70879

Which I find interesting since I don’t know that system could even provide virtual memory in that amount.

If this is interesting to you, I captured some heap profiles from pprof which I can pass along, and I can also provide updated data

Thanks again,
Steven

@sbv1976 If you can share the updated dataset? It sounds like there’s still a lot of predicates.

Or you can share the heap profile and jemalloc debug info:

  • heap profile from /debug/pprof/heap
  • jemalloc info from /debug/jemalloc

Thanks for the response…

I did a more careful analysis of the data set, and I am down to about 202k predicates right now. 134k of those are more high cardinality use cases with unique ids in the predicate.

Let me try to address that before I take any more of your time.

Steven

1 Like

Yup, you’ll want to reduce the number of predicates in your Dgraph setup down from 202k.

What do you think is an acceptable soft limit on predicate unique count? 100K, 50K, 10K, 1K, 500??

Around 100K would be a lot but also OK.

If you have a lot of the same kind of relationship then to take advantage of Dgraph you don’t want to encode unique information in the predicate name itself. As an example, don’t do this:

<0x1> <To.ID.123> <0x2> .
<0x3> <To.ID.456> <0x4> .
<0x5> <To.ID.789> <0x6> .

It’s better to store this unique ID information (ID.123, ID.456, ID.789) as a value of the edge instead of in the edge name itself:

<0x1> <ID> "123" .
<0x1> <To> <0x2> .
<0x3> <ID> "456" .
<0x3> <To> <0x4> .
<0x5> <ID> "789" .
<0x5> <To> <0x6> .

Instead of thousands of <For.ID.XYZ> predicates there are two: <ID> and <To>.

1 Like

Yeah, that structure was a result of trying to import a document database (mongo) with one (flattened) document per node.

I managed to get the predicate count down to around 60k but it seems there are quite a few cases where we were using dictionaries so there are a few cases of this left… since it’s getting harder and harder to identify those cases reliably and since this is still an evaluation, I left those as-is.

Unfortunately, I am now running in to new sorts of issues caused by the choice of data transformation.

I won’t detail them as I have pretty much decided to abandon the flattened-document approach and am going to start working on a more node-intensive solution.

However, I was wondering if there was any literature on how to best organize data to maximize the performance of queries that contain many conditions.