Import an existing data set that consists of around 350M RDF quads. There are a lot of
uids but I don’t have precise statistics on those.
There are also a large number of unique predicates… I don’t have an exact count, but when I tried to create a schema file that referenced them all (including types), it was about 30GiB uncompressed.
dgraph bulk --schema bulk-atl/out.schema --files bulk-atl --num_go_routines 1 --mapoutput_mb 1024 --zero dgraph-zero-2.dgraph-zero:5080 --tmp tmp --partition_mb 2
and various alternatives of the above with different sizes for the map files (down to 64MB) and the default
I have found earlier reports that have been addressed in the latest builds with the addition of jemalloc, but my build has that and I have failed to import none-the-less.
It would be important to note that I running in a kubernetes cluster monopolizing progressively larger nodes to where my most recent tests have been running on vms with 256GiB RAM; however, this doesn’t seem to matter as the process dies with
runtime: out of memory: cannot allocate 4194304-byte block (83409764352 in use)
When usage is at around 100GiB
I understand that this may be systems management issue, but while I pursue that potential problem, I did want to get some comments on the feasibility and resource requirements of importing a data set like mine.
I have not yet tried to increase the number of map or reduce shards as suggested in other discussions since the help docs state that can only increase memory, but I will try that while waiting on a response for completeness.
Thanks for any help.
dgraph bulk identifying logs
graph version : v21.03.1 Dgraph codename : rocket-1 Dgraph SHA-256 : a00b73d583a720aa787171e43b4cb4dbbf75b38e522f66c9943ab2f0263007fe Commit SHA-1 : ea1cb5f35 Commit timestamp : 2021-06-17 20:38:11 +0530 Branch : HEAD Go version : go1.16.2 jemalloc enabled : true
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.
___ Begin jemalloc statistics ___
Build-time option settings
Run-time option settings
opt.background_thread: true (background_thread: true)
opt.dirty_decay_ms: 10000 (arenas.dirty_decay_ms: 10000)
opt.muzzy_decay_ms: 0 (arenas.muzzy_decay_ms: 0)
opt.prof_active: true (prof.active: false)
opt.prof_thread_active_init: true (prof.thread_active_init: false)
opt.lg_prof_sample: 19 (prof.lg_sample: 0)
Quantum size: 16
Page size: 4096
Maximum thread-cached size class: 32768
Number of bin size classes: 36
Number of thread-cache bin size classes: 41
Number of large size classes: 196
Allocated: 4314440, active: 4362240, metadata: 11125960 (n_thp 0), resident: 15384576, mapped: 23236608, retained: 5599232
Background threads: 4, num_runs: 8, run_interval: 684055750 ns