Observed:
When importing some RDF data sets given to us by a third-party we have encountered a few schema issues whereby the bulk loader (which we are running in a Docker container) reports the (legit) error and fails. It was not immediately obvious simply by looking at the log whether or not the process stopped there or continued; we tend to let these imports run overnight and the first thing we do is view the log in the morning.
Expected result:
The bulk loader should explicitly state in the log whether or not it completed processing the data set; especially where in the data set it stopped. Some kind of “Exiting” message would be the least optimal solution but still better than nothing.
Hey @mattysan, can you provide us some instances when bulk loader is not very clear about completion. At every second it prints the progress of data insertion like stage(map or reduce) and
how much progress it has made.
@ashish-goswami Before I do that, can you tell me what it prints out upon completion of a successful import? And what does it print out upon completion of an unsuccessful import? If you can’t answer those two questions, then you have a bit of work to do, rather than asking me to give you examples to delay fixing an obvious issue. The issue at hand is simple: If I want to monitor a log to tell when an import is done or when it has failed, then what do I look for? It failing is just as important to know as it succeeding, so that I could make some deployment automation decisions (example: roll back upon unsuccessful import).
Hey @mattysan, I get your use case now. Bulkloader prints its progress every second. for example look at its output on 21M dataset run. It prints that phase is reduce and %age of completion in every line. 100% in last line.