Hi,
Trying to use the bulk loader to load data from rdf.gz files.
I’m developing the file (incl. schema) generation and then directly try to use that with the bulk loader.
Often the bulk loader fails, simply stating something like this:
[1] 3743675 killed dgraph bulk -f -s /home/.../Dgraph.schema
dgraph bulk -f -s /home/.../Dgraph.schema 77475.47s user 20635.37s system 2839% cpu 57:35.81 total
I guessing the problem in many of the cases has been memory consumption related, but I don’t know that. How can I get info about why the process was killed?
With my last attempt the process was stopped in early stages, and this time there was a clear message about what the problem was.
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1374
while parsing line "_:Ecun_DtLXNTa8gD6FxzLyRg <name> \"Ra, Phil - S?gaard, Birgit\" (confidence=NaN) .\n"
github.com/dgraph-io/dgraph/chunker.(*rdfChunker).Parse
/ext-go/1/src/github.com/dgraph-io/dgraph/chunker/chunk.go:156
One triple had an edge attribute that was NaN, which couldn’t be parsed, and that killed the entire bulk loading process. Why? Can’t you just catch that error, skip that line and move on?
Ideally, at the end, I can get list of all failures logged to separate file.
I need the bulk loader to be more tolerant and informative when things go wrong. Is there anything I can do, now, to make it so?
/Anders
Dgraph version : v20.11.0
Dgraph codename : tchalla
Dgraph SHA-256 : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1 : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch : HEAD
Go version : go1.15.5
jemalloc enabled : true