Can we run bulk uploader multiple times?

Question:

  • Can we run bulk uploader multiple times? We were getting below error when bulk uploader was launched more than once
    Output directory exists and is not empty. Use --replace_out to overwrite it.
    
  • We launch bulk uploader as soon as our data pipeline finishes writing .rdf.gz data file. This process continues till no more files
  • Each file was about 250 MB in size

Below steps we follow…

  • Made sure at least one zero was running
  • Brought one alpha that was blocked with init container (thanks to helm chart)
  • Executed bulk uploader command from /dgraph folder on zero
  • We used a cronjob that wakes up every 1 minute to launch bulk loader command, if there are any files in ${files_in_ready_state} folder. Below is our command snippet
    dgraph bulk -f ${files_in_ready_state} -s ${schemaFile} --format=rdf --xidmap xid --store_xids --out /coldstart/out --map_shards=3 --reduce_shards=3 --zero=dgraph-dgraph-zero:5080
    

Unless something changed recently, I believe this is a no. Bulk loader and backup/restore both need a live zero server, and outputs to a local p directory which you have to copy to the alphas.

So the alphas p directory has to get replaced. Live loader is more incremental.

If this is true, isn’t it a huge limitation?

  • Does Bulk upload scale out?
  • What should be the process to ingest few trillions of predicates?
  • Submit all .rdf.gz files in one go? that is so silly? This calls of a serios spark connector!
  • I’m red multiple POSTs from this forum members
    – Bulk uploader choke on memory since it reads all the files at the same time