Bulk loading and live loading can take quite a while for large datasets, it’s very annoying that because of a badly formatted entry the process crashes after hours of running.
What you actually did
I had to fix the issue and rerun the whole process from scratch.
Why that wasn’t great, with examples
It’s a waste of time and resources.
Instead, I would have expected the input to be validated before the process started so any errors would be detected early on.
I ran into similar issues when first starting out using Dgraph and built a basic validator leveraging the bulk loader codebase with support for:
RDF validation
Schema validation
Empty or corrupt gzip file validation
All three conditions can potentially terminate the bulk load process and if you’re unlucky, several hours into the map phase. Fortunately it’s possible to support a basic validator leveraging the existing codebase with little effort and adding a --dryrun or similar CLI flag to the bulk loader.