Add a validator for bulk and live loader

diggy · September 13, 2019, 2:58pm

Moved from GitHub dgraph/3984

Posted by campoy:

Experience Report

What you wanted to do

I wanted to load a big dataset use bulk loader.

Bulk loading and live loading can take quite a while for large datasets, it’s very annoying that because of a badly formatted entry the process crashes after hours of running.

What you actually did

I had to fix the issue and rerun the whole process from scratch.

Why that wasn’t great, with examples

It’s a waste of time and resources.
Instead, I would have expected the input to be validated before the process started so any errors would be detected early on.

Any external references to support your case

diggy · April 24, 2020, 1:46am

n3integration commented :

I ran into similar issues when first starting out using Dgraph and built a basic validator leveraging the bulk loader codebase with support for:

RDF validation
Schema validation
Empty or corrupt gzip file validation

All three conditions can potentially terminate the bulk load process and if you’re unlucky, several hours into the map phase. Fortunately it’s possible to support a basic validator leveraging the existing codebase with little effort and adding a --dryrun or similar CLI flag to the bulk loader.

Topic		Replies	Views
Bulk Loader - Deploy Documentation	0	892	December 16, 2020
Bulk loader aborts on single corrupted input file instead of continuing with other valid inputs Dgraph dgraph , kind:enhancement , status:accepted , area:usability , area:bulk-loader	7	735	June 23, 2020
Schema and mutation RDF triples / bulk loader error handling Dgraph kind:question	4	377	January 13, 2021
Improve Loaders: Add feature to continue a previous load Dgraph dgraph , status:accepted , kind:feature , area:usability , area:bulk-loader	1	552	April 11, 2019
Improve throughput of bulk loader with distributed loading Dgraph dgraph , kind:enhancement , priority:p2 , status:accepted , popular	21	1025	February 6, 2020