Some mistakes when running dgraph bulk

When I import data using dgraph bulk, there are the following errors in the TMP directory about data 2TB. What is the problem?

MAP 04m04s rdf_count:35.61M rdf_speed:145.9k/sec edge_count:618.2M edge_speed:2.532M/sec
MAP 04m05s rdf_count:35.68M rdf_speed:145.5k/sec edge_count:619.6M edge_speed:2.527M/sec
MAP 04m06s rdf_count:35.86M rdf_speed:145.7k/sec edge_count:622.9M edge_speed:2.530M/sec
MAP 04m07s rdf_count:36.03M rdf_speed:145.8k/sec edge_count:626.2M edge_speed:2.533M/sec
...skipping...
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:44 +0xf1

goroutine 5275814 [chan send]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xca713ab980, 0x21, 0xc8962751a0)
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:95 +0x359
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:44 +0xf1

goroutine 5275815 [chan send]:
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.readMapOutput(0xca713ab9e0, 0x21, 0xc896275200)
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:95 +0x359
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shuffler).run.func1
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/shuffle.go:44 +0xf1

rax    0x0
rbx    0x7f84a2b2b868
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x44d5
rsi    0x44d6
rbp    0x145a8de
rsp    0x7f84a2762928
r8     0xa
r9     0x7f84a2763700
r10    0x8
r11    0x202
r12    0x7f82c80008c0
r13    0xf1
r14    0x11
r15    0x0
rip    0x7f84a279a277
rflags 0x202
cs     0x33
fs     0x0
gs     0x0

@mrjn @MichelDiz

Can anyone know what’s going on ?

Sorry for the delay, please share more information.
Your specs, settings used, commands and etc.

How many shuffles are you using in bulk load? Do you have enough memory?

Hard disk capacity 12TB
Memory 64G
data:171GB
data exp:

<EB855DE> <Name> "0991j@**.com" .
<EB855DE> <Email> "0991j@**.com" .
<EB855DE> <EmailPassword> "******" .
<EB85BEF> <Name> "0992j@**.com" .
<EB85BEF> <Email> "0992j@**.com" .
<EB85BEF> <EmailPassword> "******"  .
<EB85C07> <Name> "0993jb@**.com" .
<EB85C07> <Email> "0993jb@**.com" .
<EB85C07> <EmailPassword> "******"  .
<EB85C0E> <Name> "0994jb@**.com" .
<EB85C0E> <Email> "0994jb@**.com" .
<EB85C0E> <EmailPassword> "******"  .

schema

Name: string @index(term,fulltext,trigram) .
Email: string @index(term,fulltext,trigram) .
EmailPassword: string .

command

dgraph bulk -r all.rdf -s goldendata.schema  --http localhost:8000 --zero=localhost:5080

Thanks for the information, I’ll try to investigate and warn an engineer about. I can not promise a specific result of my part because I have no ability with the Dgraph Core code. And I do not have access to your material to dig. (and it’s pretty big)

PS. Ah, please let me know how you are running the “Dgraph Zero”.

I can see where the problem happens, but I do not understand if it’s from Dgraph (It seems unlikely) or it could be a corrupted file problem (your RDF).

At first I thought it might be some capacity/specs problem. Dgraph needs at least two and a half free disk space for temporary files. And also increasing the shuffles can occur high memory usage and processing. A configuration error there could corrupt Load (in theory).

Could you build a code in Python to split your RDF into smaller pieces? (I do not know if there is a tool for this already available, but it is a tip) so it would be easier to find out where the problem comes from. In case of corruption.

And you could normally do Bulk by chunks.

But I’ll see who can give you a light.

Cheers.

I run dgraph zero just like doc

nohup dgraph zero > zero.log 2>&1 &

I’ve split the files into small pieces, and then imported them using dgraph live.

but too slow,Import 2G data,It has been used for 10 hours。

You can import small parts (file by file) via bulk load, as long as you never run the servers before completing all loads.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.