Schema and mutation RDF triples / bulk loader error handling

Hi,

Trying to use the bulk loader to load data from rdf.gz files.

I’m developing the file (incl. schema) generation and then directly try to use that with the bulk loader.

Often the bulk loader fails, simply stating something like this:

[1]    3743675 killed     dgraph bulk -f  -s /home/.../Dgraph.schema
dgraph bulk -f  -s /home/.../Dgraph.schema   77475.47s user 20635.37s system 2839% cpu 57:35.81 total

I guessing the problem in many of the cases has been memory consumption related, but I don’t know that. How can I get info about why the process was killed?

With my last attempt the process was stopped in early stages, and this time there was a clear message about what the problem was.

runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1374
while parsing line "_:Ecun_DtLXNTa8gD6FxzLyRg <name> \"Ra, Phil - S?gaard, Birgit\" (confidence=NaN) .\n"
github.com/dgraph-io/dgraph/chunker.(*rdfChunker).Parse
        /ext-go/1/src/github.com/dgraph-io/dgraph/chunker/chunk.go:156

One triple had an edge attribute that was NaN, which couldn’t be parsed, and that killed the entire bulk loading process. Why? Can’t you just catch that error, skip that line and move on?

Ideally, at the end, I can get list of all failures logged to separate file.

I need the bulk loader to be more tolerant and informative when things go wrong. Is there anything I can do, now, to make it so?

/Anders

Dgraph version   : v20.11.0
Dgraph codename  : tchalla
Dgraph SHA-256   : 8acb886b24556691d7d74929817a4ac7d9db76bb8b77de00f44650931a16b6ac
Commit SHA-1     : c4245ad55
Commit timestamp : 2020-12-16 15:55:40 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

You can use

dgraph bulk -h | grep ignore
      --ignore_errors                    ignore line parsing errors in rdf files

Not sure but I believe there is a page that throws some stats during the bulkload. Need to check this.

Also, Dgraph prints the amount of RAM used in the version you are using.

If this is part of your string you should escape it (using JSON escaping).

I’m gonna investigate the potential bug of the NaN parsing.

Thanks, --ignore_errors should be helpful. Not sure why I missed that option. I did look for something like that.

I noticed the logs are different since I upgraded to the latest version. It would be great if it was possible to have some stats dumped on a crash.

The Bulk Loader docs discuss some performance tuning options. I’d appreciate a more detailed discussion on what they do and how they’re related as well as some quantifications/estimates on how much memory would be required in various cases.

If I plan to load something like 8192 rdf.gz files totalling close to 2TB, what kind of hardware would you say I need to run the bulk loader?

I continue to have problems with this. If/when I run the bulk loader on a very small subset of our data, and the default settings, it works. The only problem then is that extrapolating the runtime for the full data set we end up with something like 8days. That’s way too long for us! (and that’s assuming we can keep the same processing rate on a much larger data set).

When I increase data size and/or start experimenting with the various performance tuning “flags” I either see no difference or the process crashes. Before this particular crash I could see that system memory (394G) was entirely consumed by the bulk loader.

I need to stop these crashes AND get some sort of performance increase, and would very much appreciate some guidance.

These are the configurations of that last execution:

--ignore_errors --num_go_routines=24 --map_shards=4 --reduce_shards=2 --reducers=2

The machine I’m currently testing on has 48 “processors” (more precisely 1 CPU, 24 cores and 48 threads) and 394G RAM.

badger 2021/01/12 23:42:22 INFO: Compaction backed off 23000 times
badger 2021/01/12 23:42:23 INFO: Compaction backed off 23000 times
badger 2021/01/12 23:42:23 INFO: Compaction backed off 23000 times
[23:42:23-0800] MAP 01h05m14s nquad_count:3.180G err_count:0.000 nquad_speed:812.6k/sec edge_count:3.913G edge_speed:999.8k/sec jemalloc: 52 GiB 
[23:42:24-0800] MAP 01h05m15s nquad_count:3.181G err_count:0.000 nquad_speed:812.4k/sec edge_count:3.913G edge_speed:999.5k/sec jemalloc: 52 GiB 
badger 2021/01/12 23:42:25 INFO: Compaction backed off 22000 times
[23:42:25-0800] MAP 01h05m16s nquad_count:3.181G err_count:0.000 nquad_speed:812.2k/sec edge_count:3.913G edge_speed:999.3k/sec jemalloc: 52 GiB 
[23:42:26-0800] MAP 01h05m17s nquad_count:3.181G err_count:0.000 nquad_speed:812.0k/sec edge_count:3.913G edge_speed:999.0k/sec jemalloc: 52 GiB 
[23:42:27-0800] MAP 01h05m18s nquad_count:3.181G err_count:0.000 nquad_speed:811.8k/sec edge_count:3.913G edge_speed:998.8k/sec jemalloc: 52 GiB 
[23:42:28-0800] MAP 01h05m19s nquad_count:3.181G err_count:0.000 nquad_speed:811.6k/sec edge_count:3.913G edge_speed:998.6k/sec jemalloc: 52 GiB 
[23:42:29-0800] MAP 01h05m20s nquad_count:3.181G err_count:0.000 nquad_speed:811.3k/sec edge_count:3.913G edge_speed:998.2k/sec jemalloc: 52 GiB 
[23:42:30-0800] MAP 01h05m21s nquad_count:3.181G err_count:0.000 nquad_speed:811.2k/sec edge_count:3.913G edge_speed:998.1k/sec jemalloc: 52 GiB 
[23:42:31-0800] MAP 01h05m22s nquad_count:3.181G err_count:0.000 nquad_speed:811.0k/sec edge_count:3.913G edge_speed:997.8k/sec jemalloc: 52 GiB 
[23:42:32-0800] MAP 01h05m23s nquad_count:3.181G err_count:0.000 nquad_speed:810.8k/sec edge_count:3.914G edge_speed:997.6k/sec jemalloc: 52 GiB 
[23:42:33-0800] MAP 01h05m24s nquad_count:3.181G err_count:0.000 nquad_speed:810.6k/sec edge_count:3.914G edge_speed:997.3k/sec jemalloc: 52 GiB 
[23:42:34-0800] MAP 01h05m25s nquad_count:3.181G err_count:0.000 nquad_speed:810.4k/sec edge_count:3.914G edge_speed:997.1k/sec jemalloc: 52 GiB 
[1]    1208162 killed     dgraph bulk -f  -s  --format=rdf --xidmap xidmap --http localhost:8000      
dgraph bulk -f  -s  --format=rdf --xidmap xidmap --http localhost:8000        95693.23s user 12025.46s system 2722% cpu 1:05:57.24 total