Some mistakes when running dgraph bulk

ogreso · May 23, 2018, 2:13am

When I import data using dgraph bulk, there are the following errors in the TMP directory about data 2TB. What is the problem?

MAP 04m04s rdf_count:35.61M rdf_speed:145.9k/sec edge_count:618.2M edge_speed:2.532M/sec
MAP 04m05s rdf_count:35.68M rdf_speed:145.5k/sec edge_count:619.6M edge_speed:2.527M/sec
MAP 04m06s rdf_count:35.86M rdf_speed:145.7k/sec edge_count:622.9M edge_speed:2.530M/sec
MAP 04m07s rdf_count:36.03M rdf_speed:145.8k/sec edge_count:626.2M edge_speed:2.533M/sec
...skipping...
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:44 +0xf1

goroutine 5275814 [chan send]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xca713ab980, 0x21, 0xc8962751a0)
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:95 +0x359
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:44 +0xf1

goroutine 5275815 [chan send]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xca713ab9e0, 0x21, 0xc896275200)
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:95 +0x359
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:44 +0xf1

rax    0x0
rbx    0x7f84a2b2b868
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x44d5
rsi    0x44d6
rbp    0x145a8de
rsp    0x7f84a2762928
r8     0xa
r9     0x7f84a2763700
r10    0x8
r11    0x202
r12    0x7f82c80008c0
r13    0xf1
r14    0x11
r15    0x0
rip    0x7f84a279a277
rflags 0x202
cs     0x33
fs     0x0
gs     0x0

@mrjn @MichelDiz

ogreso · May 24, 2018, 11:03am

Can anyone know what’s going on ？

MichelDiz · May 25, 2018, 10:53pm

Sorry for the delay, please share more information.
Your specs, settings used, commands and etc.

How many shuffles are you using in bulk load? Do you have enough memory?

ogreso · May 28, 2018, 2:26am

Hard disk capacity 12TB
Memory 64G
data:171GB
data exp:

<EB855DE> <Name> "0991j@**.com" .
<EB855DE> <Email> "0991j@**.com" .
<EB855DE> <EmailPassword> "******" .
<EB85BEF> <Name> "0992j@**.com" .
<EB85BEF> <Email> "0992j@**.com" .
<EB85BEF> <EmailPassword> "******"  .
<EB85C07> <Name> "0993jb@**.com" .
<EB85C07> <Email> "0993jb@**.com" .
<EB85C07> <EmailPassword> "******"  .
<EB85C0E> <Name> "0994jb@**.com" .
<EB85C0E> <Email> "0994jb@**.com" .
<EB85C0E> <EmailPassword> "******"  .

schema

Name: string @index(term,fulltext,trigram) .
Email: string @index(term,fulltext,trigram) .
EmailPassword: string .

command

dgraph bulk -r all.rdf -s goldendata.schema  --http localhost:8000 --zero=localhost:5080

MichelDiz · May 28, 2018, 4:48am

Thanks for the information, I’ll try to investigate and warn an engineer about. I can not promise a specific result of my part because I have no ability with the Dgraph Core code. And I do not have access to your material to dig. (and it’s pretty big)

PS. Ah, please let me know how you are running the “Dgraph Zero”.

I can see where the problem happens, but I do not understand if it’s from Dgraph (It seems unlikely) or it could be a corrupted file problem (your RDF).

At first I thought it might be some capacity/specs problem. Dgraph needs at least two and a half free disk space for temporary files. And also increasing the shuffles can occur high memory usage and processing. A configuration error there could corrupt Load (in theory).

Could you build a code in Python to split your RDF into smaller pieces? (I do not know if there is a tool for this already available, but it is a tip) so it would be easier to find out where the problem comes from. In case of corruption.

And you could normally do Bulk by chunks.

But I’ll see who can give you a light.

Cheers.

ogreso · May 28, 2018, 6:21am

I run dgraph zero just like doc

nohup dgraph zero > zero.log 2>&1 &

I’ve split the files into small pieces, and then imported them using dgraph live.

but too slow，Import 2G data，It has been used for 10 hours。

MichelDiz · May 28, 2018, 3:38pm

You can import small parts (file by file) via bulk load, as long as you never run the servers before completing all loads.

system · June 27, 2018, 3:38pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with dgraph bulk -r command Users	8	844	June 15, 2018
Data Ingestion very slow Users	6	1084	October 25, 2018
Bulk loader becomes slow when memory gets full Users	20	2169	December 17, 2017
Dgraph bulk load with much data Dgraph	7	1610	April 26, 2019
How to check errors after bulk loading Dgraph	1	267	May 28, 2021

Some mistakes when running dgraph bulk

Related topics