Attached is my basic Live uploader K8S cronjob, does anyone have a better one?

porsche · January 30, 2022, 4:48pm

Does anyone have K8S CronJob that upload data to dgraph?
Below is my basic cronjob.
Thanks

– Run only one instance of the job, never kill a long running
– Wake up every five minutes to check data files to upload in /dgraph/upload/ready
– Schama is present in /dgraph/upload/schama directory
– Remove the data file after successful upload

apiVersion: batch/v1beta1
kind: CronJob
metadata: 
  name: dgraph-live-uploader-job
  namespace: dgraph  
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: "Forbid"
  failedJobsHistoryLimit: 3
  successfulJobsHistoryLimit: 10
  startingDeadlineSeconds: 10 # 1 min
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: dgraph-live-uploader
              image: dgraph/dgraph:v21.03.1
              imagePullPolicy: IfNotPresent
              command:
              - sh
              - -c
              - |
                #!/usr/bin/env bash

                echo Starting Live uploader CronJob
                
                schemaFile="/dgraph/upload/scheme/studient.rdf"
                echo Schema file is ${schemaFile}

                for file in /dgraph/upload/ready*/*; do
                
                   if [ ! -f "$file" ]; then
                     echo "Not a file: $file"
                     continue
                   fi
                
                   dgraph live --files ${file} --schema ${schemaFile} --alpha dgraph-dgraph-alpha:9080 --zero dgraph-dgraph-zero:5080 --format=rdf --upsertPredicate "xid" -b 3000
                   returnCode=$?
                   echo The dgraph live ReturnCode = "${returnCode}"
                   if [ ${returnCode} -eq 0 ]; then
   
                      echo Removing successfully processed data file ${file}
                      rm  ${file}
                   fi
                done
              volumeMounts:
                - name: datadir
                  mountPath: /dgraph
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          nodeSelector:
            agentpool: "ingest"          
          volumes:
            - name: datadir
              persistentVolumeClaim:
                claimName: dgraph-azurefile

iluminae · January 30, 2022, 7:22pm

Our approach was to use benthos with a dgraph output that used the gRPC bindings, same as the live loader. Benthos would read from Google pubsub, batch up inserts and write to dgraph constantly. Up to about 35Billion nquads in Dgraph now.

porsche · January 30, 2022, 7:40pm

How is the performance, how many days does it take to finish 35 billion predicates?
We wanted to use Live uploader cronjob for cold start (initial import) only
Will have to do what you are doing for daily update feed
We could have used Bulk loader here, but docs weren’t clear
– For example, doc says to use Bulk loader only Zeros should be running?
– If only Zero’s are running, for billions of predicates where does dgraph store the data?
– Also, we are using cheap hardware for Zero’s

iluminae · January 31, 2022, 1:42pm

It’s not like I had 35bn sitting there ready to insert, they come in slowly for years as the app is used, so no real way to say.

yes, zeros running - bulk loader formats the data for the alphas to start up with.
in ./out/<groupnum>/p and you move the directories to the right hosts.
probably ok.

Take another read of the bulk loader docs.

Topic		Replies	Views
Load Data using Dgraph w/ Kubernetes Users kind:question , area:bulk-loader , area:kubernetes	8	1396	March 24, 2024
Slow data ingestion? Dgraph dgraph	3	638	January 19, 2022
Rough times kicking the tires with 1 server and a simple importer Dgraph	4	586	April 6, 2019
Live uploader: No data files found in Dgraph dgraph	6	651	January 18, 2022
How to: Live Load distributed with Kubernetes, Docker or Binary Dgraph dataset , liveload	6	2355	July 9, 2021

Attached is my basic Live uploader K8S cronjob, does anyone have a better one?

Related topics