Attached is my basic Live uploader K8S cronjob, does anyone have a better one?

Does anyone have K8S CronJob that upload data to dgraph?
Below is my basic cronjob.
Thanks

– Run only one instance of the job, never kill a long running
– Wake up every five minutes to check data files to upload in /dgraph/upload/ready
– Schama is present in /dgraph/upload/schama directory
– Remove the data file after successful upload

apiVersion: batch/v1beta1
kind: CronJob
metadata: 
  name: dgraph-live-uploader-job
  namespace: dgraph  
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: "Forbid"
  failedJobsHistoryLimit: 3
  successfulJobsHistoryLimit: 10
  startingDeadlineSeconds: 10 # 1 min
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: dgraph-live-uploader
              image: dgraph/dgraph:v21.03.1
              imagePullPolicy: IfNotPresent
              command:
              - sh
              - -c
              - |
                #!/usr/bin/env bash

                echo Starting Live uploader CronJob
                
                schemaFile="/dgraph/upload/scheme/studient.rdf"
                echo Schema file is ${schemaFile}

                for file in /dgraph/upload/ready*/*; do
                
                   if [ ! -f "$file" ]; then
                     echo "Not a file: $file"
                     continue
                   fi
                
                   dgraph live --files ${file} --schema ${schemaFile} --alpha dgraph-dgraph-alpha:9080 --zero dgraph-dgraph-zero:5080 --format=rdf --upsertPredicate "xid" -b 3000
                   returnCode=$?
                   echo The dgraph live ReturnCode = "${returnCode}"
                   if [ ${returnCode} -eq 0 ]; then
   
                      echo Removing successfully processed data file ${file}
                      rm  ${file}
                   fi
                done
              volumeMounts:
                - name: datadir
                  mountPath: /dgraph
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          nodeSelector:
            agentpool: "ingest"          
          volumes:
            - name: datadir
              persistentVolumeClaim:
                claimName: dgraph-azurefile

Our approach was to use benthos with a dgraph output that used the gRPC bindings, same as the live loader. Benthos would read from Google pubsub, batch up inserts and write to dgraph constantly. Up to about 35Billion nquads in Dgraph now.

  • How is the performance, how many days does it take to finish 35 billion predicates?
  • We wanted to use Live uploader cronjob for cold start (initial import) only
  • Will have to do what you are doing for daily update feed
  • We could have used Bulk loader here, but docs weren’t clear
    – For example, doc says to use Bulk loader only Zeros should be running?
    – If only Zero’s are running, for billions of predicates where does dgraph store the data?
    – Also, we are using cheap hardware for Zero’s

It’s not like I had 35bn sitting there ready to insert, they come in slowly for years as the app is used, so no real way to say.

  • yes, zeros running - bulk loader formats the data for the alphas to start up with.
  • in ./out/<groupnum>/p and you move the directories to the right hosts.
  • probably ok.

Take another read of the bulk loader docs.