Dgraph Data Load Runs Faster on Laptop than K8s

I am running Dgraph on 3 pods on a dedicated node in Kubernetes Azure Cluster and my developer is running it on his Windows Laptop. The Laptop is running Dgraph through Docker and his data load is running about 20 times faster loading from the same source. He’s located outside Azure and the data source is in the same cloud as my k8s cluster. I don’t understand why this is not the reverse case. My server and disk metrics show the resources are barely taxed.

My first bet is in the k8s resources. I think it might have some bottleneck in the way k8s handle resources. Have you tried to at least give special rights or bind the volumes?

What are the specs?
Are you using live loader?

pinging @joaquin, do you have any idea what could be? I think I have noticed something like this in the past, but I don’t use k8s all the time. My own tests are made in bare metal (My computers). So Dgraph has total access to the resources aways.

I am not sure, so I would not brave enough guess, but I have access to a Win10 Home (Docker-Machine w/ Virtualbox), Win10 Pro (DockerDesktop) and Azure AKS, so if it is not too much trouble, maybe share further about the setup. Some of this just curious as well, so I hope that is alright.

Questions:

  • In Azure, assuming this is AKS, how many VMs? What location? What is the datasource? Azure Blob? NFS?
  • How is the data loaded, e.g. Liveloader? Bulkloader? Restore? How is it fetched, scp? (wondering if it is downloaded locally or fetched directly using azure blob w/ minio gateway for example)
  • How big is the dataset? 1 million predicates? 16 million predicates? etc.
  • For Windows, is this Windows 10 Pro w/ Docker Desktop?

My theory is that maybe distributed nature and liveloader, but I would like to get info before testing it out. I was also thinking, as an extra compare point, I could run Kubernetes locally (e.g. MiniKube), three VM cluster on Windows for an extra compare point.

Thank you for your response, we’ve deployed Dgraph to a dedicated node in Azure as a 4CPU 32GbRAM machine with premium-lrs disks, file descriptors are plentiful. Our Alphas are deployed in a statefulset with
- name: “compliance-dgraph-alpha”
resources:
requests:
memory: “24Gi”
cpu: “1”
limits:
memory: “24Gi”
cpu: “3.5”

And the Zero is
- name: “compliance-dgraph-zero”
resources:
requests:
memory: “128Mi”
cpu: “250m”
limits:
memory: “512Mi”
cpu: “500m”

It’s in AKS with a 9 node cluster with one of those as a dedicated dgraph node. The ratel, alpha, and zero run on it.
We run an upsert dgraph process which loads it as a live load.
The dataset is 8million objects we call matches (our main predicate)

corerrection, we also use another live loader that feeds in data from agents deployed in the field.
It is not the one we are using to do most of the initial data load.

Here is a typical message from our live loader
update-consumer INFO ≫ Update Consumer Record count 10000 took 515.6331701278687 secs
the times are around 510-540 seconds for 10000 records

This is the Dgraph Node over the last 3-4 weeks

The zero keeps restarting

badger 2020/10/16 21:59:59 INFO: Replaying file id: 5 at offset: 0
badger 2020/10/16 21:59:59 INFO: Replay took: 11.898033ms
I1016 22:00:00.915513      17 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0018367c0 Applied:1436271 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2bcf318 DisableProposalForwarding:false}
I1016 22:00:00.915835      17 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1] Learners:[] XXX_unrecognized:[]} Index:1436271 Term:7 XXX_unrecognized:[]}
I1016 22:00:00.915865      17 node.go:317] Found hardstate: {Term:7 Vote:1 Commit:1436485 XXX_unrecognized:[]}
unexpected fault address 0x7fbe8b73609d
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7fbe8b73609d pc=0x11c338b]

goroutine 1 [running]:
runtime.throw(0x1b8ebef, 0x5)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc00049a3b0 sp=0xc00049a380 pc=0xa1af82
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:692 +0x443 fp=0xc00049a3e0 sp=0xc00049a3b0 pc=0xa32983
encoding/binary.bigEndian.Uint32(...)
	/usr/local/go/src/encoding/binary/binary.go:113

Are there other zeros? What was the environment for this (window? ubuntu? docker-compose.yml? 6 node cluster?) Do you know how to reproduce, such as doing live loader, etc.)?

It looks like the data directory to another cluster (say cluster A) was mounted to this cluster (cluster B)? Did you have an earlier cluster with mounted directory or volume where the dgraph data was installed, and then start a new cluster with the same mounted directory or volume? The other scenario possibly is that the underlying volume or datadir got corrupt.

The solution to resolve this and get going again:

  • if you have other zeros available is to go through a removeNode process on the faulty zero, and add a new zero node member. (safest way for HA scenario)
  • if there’s only one zero, just remove it, and create a new zero with a new datadir (don’t use the existing datadir). (destructive way, for dev env)

Is this the full stack trace? This looks truncated.