Dgraph bulk load with much data


#1

Hello everyone,

I tried to import over 900 rdf.gz files into DGraph yesterday on one ubuntu node with 4 cores, 32G memory.
MAP process is OK, but REDUCE process failed with outputing “too many open files”.
I checked the tmp directory, and its size is 280G, and there are over 4000 .map files.

So could someone help me with this? How can I import these .rdf.gz files into DGraph?

detailed info is as follows:

REDUCE 04h15m55s [0.00%] edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec
2019/04/25 14:14:22 open tmp/shards/shard_0/000/001308.map: too many open files

github.com/dgraph-io/dgraph/x.Wrap
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:91
github.com/dgraph-io/dgraph/x.Check
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:41
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:80
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1333


(Michel Conrado) #2

over 900x RDFs?
Can you share the steps you did?


#3

there are over 900 files.
steps:
dgraph zero
dgraph bulk -s test.schema -r ~/rdf_dir/


(Aman Mangal) #4

Have you tried increasing the max limit on number of open files? By default, linux has limit on number of open files to 1024. You can set that to a higher number like 100,000 or something using the ulimit function.


#5

thank you.
I am trying it now.


#6

Hi, the “too many open files” error disappeared, but new error came.
the size of directory tmp/ is 283G, how large is the memory enough for the data?

REDUCE 05h11m45s [24.67%] edge_count:1.714G edge_speed:528.5k/sec plist_count:180.0M plist_speed:55.49k/sec
REDUCE 05h11m46s [24.67%] edge_count:1.714G edge_speed:528.4k/sec plist_count:180.0M plist_speed:55.47k/sec
REDUCE 05h11m47s [24.67%] edge_count:1.714G edge_speed:528.1k/sec plist_count:180.0M plist_speed:55.45k/sec
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x147d882, 0x16)
/usr/local/go/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc6fc000000, 0x4000000, 0x1fbe3b8)


(Aman Mangal) #7

Generally, memory usage can be reduced by using a smaller value for number of go routines. By default it is set equal to number of CPUs on the machine. You could reduce it further by using the j switch “-j 2” or even “-j 1”. This will take more time to complete, of course.


(Michel Conrado) #8

Hey Dgrdev could you provide your heap profile? https://docs.dgraph.io/howto/#profiling-information