Add a validator for bulk and live loader

Moved from GitHub dgraph/3984

Posted by campoy:

Experience Report

What you wanted to do

I wanted to load a big dataset use bulk loader.

Bulk loading and live loading can take quite a while for large datasets, it’s very annoying that because of a badly formatted entry the process crashes after hours of running.

What you actually did

I had to fix the issue and rerun the whole process from scratch.

Why that wasn’t great, with examples

It’s a waste of time and resources.
Instead, I would have expected the input to be validated before the process started so any errors would be detected early on.

Any external references to support your case

n3integration commented :

I ran into similar issues when first starting out using Dgraph and built a basic validator leveraging the bulk loader codebase with support for:

  1. RDF validation
  2. Schema validation
  3. Empty or corrupt gzip file validation

All three conditions can potentially terminate the bulk load process and if you’re unlucky, several hours into the map phase. Fortunately it’s possible to support a basic validator leveraging the existing codebase with little effort and adding a --dryrun or similar CLI flag to the bulk loader.