Dgraph bulk load with much data

dgrdev · April 26, 2019, 1:19am

Hello everyone，

I tried to import over 900 rdf.gz files into DGraph yesterday on one ubuntu node with 4 cores, 32G memory.
MAP process is OK, but REDUCE process failed with outputing “too many open files”.
I checked the tmp directory, and its size is 280G, and there are over 4000 .map files.

So could someone help me with this? How can I import these .rdf.gz files into DGraph?

detailed info is as follows:

REDUCE 04h15m55s [0.00%] edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec
2019/04/25 14:14:22 open tmp/shards/shard_0/000/001308.map: too many open files

github.com/dgraph-io/dgraph/x.Wrap
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:91
github.com/dgraph-io/dgraph/x.Check
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:41
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:80
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1333

MichelDiz · April 26, 2019, 2:12am

over 900x RDFs?
Can you share the steps you did?

dgrdev · April 26, 2019, 2:31am

there are over 900 files.
steps：
dgraph zero
dgraph bulk -s test.schema -r ~/rdf_dir/

amanmangal · April 26, 2019, 2:43am

Have you tried increasing the max limit on number of open files? By default, linux has limit on number of open files to 1024. You can set that to a higher number like 100,000 or something using the ulimit function.

dgrdev · April 26, 2019, 3:06am

thank you.
I am trying it now.

dgrdev · April 26, 2019, 7:26am

Hi, the “too many open files” error disappeared, but new error came.
the size of directory tmp/ is 283G, how large is the memory enough for the data?

REDUCE 05h11m45s [24.67%] edge_count:1.714G edge_speed:528.5k/sec plist_count:180.0M plist_speed:55.49k/sec
REDUCE 05h11m46s [24.67%] edge_count:1.714G edge_speed:528.4k/sec plist_count:180.0M plist_speed:55.47k/sec
REDUCE 05h11m47s [24.67%] edge_count:1.714G edge_speed:528.1k/sec plist_count:180.0M plist_speed:55.45k/sec
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x147d882, 0x16)
/usr/local/go/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc6fc000000, 0x4000000, 0x1fbe3b8)

amanmangal · April 26, 2019, 2:36pm

Generally, memory usage can be reduced by using a smaller value for number of go routines. By default it is set equal to number of CPUs on the machine. You could reduce it further by using the j switch “-j 2” or even “-j 1”. This will take more time to complete, of course.

MichelDiz · April 26, 2019, 5:13pm

Hey Dgrdev could you provide your heap profile? https://docs.dgraph.io/howto/#profiling-information

Topic		Replies	Views
Out of memory problem in large rdf file bulk load Users	8	716	October 30, 2019
Bulk loader still OOM during reduce phase Dgraph area:bulk-loader	18	871	August 1, 2021
Some mistakes when running dgraph bulk Users	7	619	June 27, 2018
Bulk loading 72.1M records from RDBMS with 0 output Dgraph bulkloader	17	1613	July 22, 2020
Bulkload fails with no error message Dgraph	6	597	May 7, 2020

Dgraph bulk load with much data

Related topics