Bulk loader aborts on single corrupted input file instead of continuing with other valid inputs

Moved from GitHub dgraph/3375

Posted by vipulmathur:

Filing this using the bug template, but this can be considered an enhancement as well.

Bulk load crashes on encountering an error (unexpected EOF) in an input file. Instead much better behavior would be to log this as an error, and continue with other input files (there were about 4000 input .rdf.gz files in this case). The files which encountered an error could have been live-loaded later. Instead, bulk load stops (in this case after running for 12+ hours) without usable output.

Hoping the current behavior of bulk load on input file errors can be changed from ‘abort bulk load’ to ‘note error, continue with other inputs’. This behavior could even be configurable with an input flag to the bulk loader if needed.

Also, note in the snippet below, the name of the errant input file is not printed. That should definitely be printed along with the error, since it would save time in identifying which of the (4000+ in the case) input files is corrupted.

MAP 12h00m11s nquad_count:4.700G err_count:0.000 nquad_speed:108.8k/sec edge_count:38.13G edge_speed:882.4k/sec
2019/05/04 05:10:18 unexpected EOF

github.com/dgraph-io/dgraph/x.Wrap
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:91
github.com/dgraph-io/dgraph/x.Check
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:41
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).mapStage.func2
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:242
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1333
  • What version of Dgraph are you using?
$ dgraph version

Dgraph version   : v1.0.14
Commit SHA-1     : 26cb2f94
Commit timestamp : 2019-04-12 13:21:56 -0700
Branch           : HEAD
Go version       : go1.11.5
  • Have you tried reproducing the issue with latest release?

    • Yes. Using the latest release v1.0.14
  • What is the hardware spec (RAM, OS)?

    • AWS EC2 m5.metal instance
    • CPU: Intel® Xeon® Platinum 8175M, 2 sockets, 24 cores per socket, 2 threads per core. So 96 threads, 48 cores.
    • RAM: 384GB RAM
    • OS: Ubuntu 18.04.2 LTS
  • Steps to reproduce the issue (command/config used to run Dgraph).

    • Start bulk loader with multiple input .rdf.gz files in a directory, one of the inputs file is a corrupted (truncated) gzip.
  • Expected behaviour and actual result.

    • Expected: bulk loader completes with all the valid input files, logs an error message for the corrupted input file (identifying the specific file). At the very least, it should be possible to resume the bulk load (after the corrupted input file has been fixed) from the point where it aborted.
    • Actual: bulk loader aborts the load in the map phase itself, thus wasting the time and output data created until the corrupted input file was encountered.

codexnull commented :

Thanks for submitting this issue. I will look into implementing this feature this week.

mangalaman93 commented :

any chance the ignore_errors flag might help here?

manishrjain commented :

We had talked about adding a input file verifier tool in Dgraph, so a user can catch these kind of issues upfront and fix their files. I think it is time to build that tool.

vipulmathur commented :

FYI, the way I identified the broken input file was by running gzip -t on each of the inputs.

codexnull commented :

FYI, the way I identified the broken input file was by running gzip -t on each of the inputs.

That’s a good idea. Until dgraph includes a verifier, it may also be a good idea to pipe JSON input through jq or similar tool to verify they are no parsing errors. You could do gzip -dc | jq . >/dev/null to catch both corrupted archive and malformed JSON at the same time.

We had talked about adding a input file verifier tool in Dgraph

If you’re talking about adding a dgraph verify command, I think that’s the wrong way to go. The user is very likely to not know of forget about it. A better approach would be to do that as part of the load process itself, with an explicit option to skip it (similar to --skip_map_phase) if it’s already been done once.

campoy commented :

This issue is related to #3984, although not the same.

While #3984 is about validating in advance, this one is about not failing on error - just logging.
It would make sense to have such a flag.

MichelDiz commented :

hey @vipulmathur sorry to ping you so late, but I need to know if you used ignore_errors in your case as Aman had mentioned. So we can identify a possible issue with this flag.