Dgraph Bulk load on version 20.07.02

Hi Everyone,
So we have the file of about 166Gi that needs to be loaded on the dgraph. We had successfully figure out the way to load the data with the 6 node cluster. We have three deployment files for the zero and three deployment files for the alpha. The sample file looks as below:
Alpha zero deployment file:

metadata:
  labels:
    io.kompose.service: alpha
    environment: production
  name: alpha
  namespace: ourspacename
spec:
  replicas: 1
  selector:
    matchLabels:
      io.kompose.service: alpha
  template:
    metadata:
      labels:
        io.kompose.service: alpha
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - thenodename
      initContainers:
      - name: init-alpha
        image: dgraph/dgraph:latest
        command:
          - bash
          - "-c"
          - |
            trap "exit" SIGINT SIGTERM
            echo "Write to /dgraph/doneinit when ready."
            until [ -f /dgraph/doneinit ]; do sleep 2; done
        volumeMounts:
          - name: alpha-claim0
            mountPath: /dgraph
      containers:
      - args:
        - dgraph
        - alpha
        - --my=alpha:7080
        - --zero=zero:5080,zero-1:5081,zero-2:5082
        image: dgraph/dgraph:latest
        name: alpha
        resources:
          limits:
            cpu: 2000m
            memory: "50Gi"
          requests:
            cpu: 1000m
            memory: "30Gi"
        ports:
        - containerPort: 8080
        - containerPort: 9080
        volumeMounts:
        - mountPath: /dgraph
          name: alpha-claim0
      volumes:
      - name: alpha-claim0
        persistentVolumeClaim:
          claimName: theclaimname

And zero:0 deployment files

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    environment: production
    io.kompose.service: zero
  name: zero
  namespace: ournamespace
spec:
  replicas: 1
  selector:
    matchLabels:
      io.kompose.service: zero
  template:
    metadata:
      labels:
        io.kompose.service: zero
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - thenodename
      containers:
      - args:
        - dgraph
        - zero
        - --my=zero:5080
        - --replicas
        - "3"
        - --idx
        - "1"
        image: dgraph/dgraph:latest
        name: zero
        resources:
          limits:
            cpu: 20000m
            memory: "350Gi"
          requests:
            cpu: 16000m
            memory: "300Gi"
        ports:
        - containerPort: 5080
        - containerPort: 6080
        volumeMounts:
        - mountPath: /dgraph
          name: zero-claim0
      volumes:
      - name: zero-claim0
        persistentVolumeClaim:
          claimName: theclaimname

So our process used to be as follows:

  1. First bring all the 3 replicas of the zero pod up.
  2. Copy our file into one of the zero pod where we have allocated most memories and cpu.
  3. After copying the data into zero pod run the following command to do the bulk load:
dgraph bulk -f dataconngraph.rdf -s finalschema.rdf --map_shards=1 --reduce_shards=1 --http localhost:8000 --zero=localhost:5080 > check.log & 

After that we used to wait for an 13 hours to complete our process. After the process was completed we used to check the logs tail -f check.log which is as shown below:

After all the process is over there resides an out folder with the size of 446G. Then we used to copy out/0/p/ into each of the alpha pod and we used to have the data populated.

Yesterday we had a production and we were using the image tag as dgraph/dgraph:latest. It seems dgraph also released a new version 20.07.2 yesterday. In dev we had 27.07.1 version.

We are doing all the steps as mentioned above. In the production though when we do the bulk load the size of the out folder is just 81G and there is no error in check.log as well. I have attached my bulk load log below for production:

Is this meant to happen? Was the file size suppose to decrease because of new release? Hope someone could guide with this.

1 Like

Hi @saugat

Thanks for reporting this, I am checking with the developers about this. Will keep you posted.

In the meantime, can you confirm that all your data is in place (for 20.07.2) - if you run some queries on 20.07.1 and 20.07.2 do you get the same result ?

Best,
Omar

Hi @saugat,

I was able to reproduce such difference while bulk loading the 21million dataset and I got the followings:

  • with 20.07.1 the out dir size is 1.4gb
  • with 20.07.2 the out dir size is 600mb

This seems to be related to the fact that with 20.07.2 the compression is enabled by default and the cache is also enabled hence the disk usage is low.

Hope this clarifies this behavior.

Best,
Omar

1 Like

Thanks @omar, thank you for confirming we also tired using the different version and it seems the 20.07.2 version out dir size is relatively smaller. Sorry about the late reply. Once again thank you so much.