Trying to use the bulk loader to load data from rdf.gz files.
I’m developing the file (incl. schema) generation and then directly try to use that with the bulk loader.
Often the bulk loader fails, simply stating something like this:
 3743675 killed dgraph bulk -f -s /home/.../Dgraph.schema dgraph bulk -f -s /home/.../Dgraph.schema 77475.47s user 20635.37s system 2839% cpu 57:35.81 total
I guessing the problem in many of the cases has been memory consumption related, but I don’t know that. How can I get info about why the process was killed?
With my last attempt the process was stopped in early stages, and this time there was a clear message about what the problem was.
runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1374 while parsing line "_:Ecun_DtLXNTa8gD6FxzLyRg <name> \"Ra, Phil - S?gaard, Birgit\" (confidence=NaN) .\n" github.com/dgraph-io/dgraph/chunker.(*rdfChunker).Parse /ext-go/1/src/github.com/dgraph-io/dgraph/chunker/chunk.go:156
One triple had an edge attribute that was NaN, which couldn’t be parsed, and that killed the entire bulk loading process. Why? Can’t you just catch that error, skip that line and move on?
Ideally, at the end, I can get list of all failures logged to separate file.
I need the bulk loader to be more tolerant and informative when things go wrong. Is there anything I can do, now, to make it so?
Dgraph version : v20.11.0 Dgraph codename : tchalla Dgraph SHA-256 : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac Commit SHA-1 : c4245ad55 Commit timestamp : 2020-12-16 15:55:40 +0530 Branch : HEAD Go version : go1.15.5 jemalloc enabled : true