What I want to do
Import an existing data set that consists of around 350M RDF quads. There are a lot of uid
s but I don’t have precise statistics on those.
There are also a large number of unique predicates… I don’t have an exact count, but when I tried to create a schema file that referenced them all (including types), it was about 30GiB uncompressed.
What I did
dgraph bulk --schema bulk-atl/out.schema --files bulk-atl --num_go_routines 1 --mapoutput_mb 1024 --zero dgraph-zero-2.dgraph-zero:5080 --tmp tmp --partition_mb 2
and various alternatives of the above with different sizes for the map files (down to 64MB) and the default --partition_mb
value.
I have found earlier reports that have been addressed in the latest builds with the addition of jemalloc, but my build has that and I have failed to import none-the-less.
It would be important to note that I running in a kubernetes cluster monopolizing progressively larger nodes to where my most recent tests have been running on vms with 256GiB RAM; however, this doesn’t seem to matter as the process dies with
runtime: out of memory: cannot allocate 4194304-byte block (83409764352 in use)
When usage is at around 100GiB
I understand that this may be systems management issue, but while I pursue that potential problem, I did want to get some comments on the feasibility and resource requirements of importing a data set like mine.
I have not yet tried to increase the number of map or reduce shards as suggested in other discussions since the help docs state that can only increase memory, but I will try that while waiting on a response for completeness.
Thanks for any help.
Dgraph metadata
dgraph bulk
identifying logs
graph version : v21.03.1
Dgraph codename : rocket-1
Dgraph SHA-256 : a00b73d583a720aa787171e43b4cb4dbbf75b38e522f66c9943ab2f0263007fe
Commit SHA-1 : ea1cb5f35
Commit timestamp : 2021-06-17 20:38:11 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.
___ Begin jemalloc statistics ___
Version: “5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756”
Build-time option settings
config.cache_oblivious: true
config.debug: false
config.fill: true
config.lazy_lock: false
config.malloc_conf: “background_thread:true,metadata_thp:auto”
config.opt_safety_checks: false
config.prof: true
config.prof_libgcc: true
config.prof_libunwind: false
config.stats: true
config.utrace: false
config.xmalloc: false
Run-time option settings
opt.abort: false
opt.abort_conf: false
opt.confirm_conf: false
opt.retain: true
opt.dss: “secondary”
opt.narenas: 128
opt.percpu_arena: “disabled”
opt.oversize_threshold: 8388608
opt.metadata_thp: “auto”
opt.background_thread: true (background_thread: true)
opt.dirty_decay_ms: 10000 (arenas.dirty_decay_ms: 10000)
opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
opt.lg_extent_max_active_fit: 6
opt.junk: “false”
opt.zero: false
opt.tcache: true
opt.lg_tcache_max: 15
opt.thp: “default”
opt.prof: false
opt.prof_prefix: “jeprof”
opt.prof_active: true (prof.active: false)
opt.prof_thread_active_init: true (prof.thread_active_init: false)
opt.lg_prof_sample: 19 (prof.lg_sample: 0)
opt.prof_accum: false
opt.lg_prof_interval: -1
opt.prof_gdump: false
opt.prof_final: false
opt.prof_leak: false
opt.stats_print: false
opt.stats_print_opts: “”
Profiling settings
prof.thread_active_init: false
prof.active: false
prof.gdump: false
prof.interval: 0
prof.lg_sample: 0
Arenas: 129
Quantum size: 16
Page size: 4096
Maximum thread-cached size class: 32768
Number of bin size classes: 36
Number of thread-cache bin size classes: 41
Number of large size classes: 196
Allocated: 4314440, active: 4362240, metadata: 11125960 (n_thp 0), resident: 15384576, mapped: 23236608, retained: 5599232
Background threads: 4, num_runs: 8, run_interval: 684055750 ns