I am running Dgraph on 3 pods on a dedicated node in Kubernetes Azure Cluster and my developer is running it on his Windows Laptop. The Laptop is running Dgraph through Docker and his data load is running about 20 times faster loading from the same source. He’s located outside Azure and the data source is in the same cloud as my k8s cluster. I don’t understand why this is not the reverse case. My server and disk metrics show the resources are barely taxed.
My first bet is in the k8s resources. I think it might have some bottleneck in the way k8s handle resources. Have you tried to at least give special rights or bind the volumes?
What are the specs?
Are you using live loader?
pinging @joaquin, do you have any idea what could be? I think I have noticed something like this in the past, but I don’t use k8s all the time. My own tests are made in bare metal (My computers). So Dgraph has total access to the resources aways.
I am not sure, so I would not brave enough guess, but I have access to a Win10 Home (Docker-Machine w/ Virtualbox), Win10 Pro (DockerDesktop) and Azure AKS, so if it is not too much trouble, maybe share further about the setup. Some of this just curious as well, so I hope that is alright.
Questions:
- In Azure, assuming this is AKS, how many VMs? What
location
? What is the datasource? Azure Blob? NFS? - How is the data loaded, e.g. Liveloader? Bulkloader? Restore? How is it fetched, scp? (wondering if it is downloaded locally or fetched directly using azure blob w/ minio gateway for example)
- How big is the dataset? 1 million predicates? 16 million predicates? etc.
- For Windows, is this Windows 10 Pro w/ Docker Desktop?
My theory is that maybe distributed nature and liveloader, but I would like to get info before testing it out. I was also thinking, as an extra compare point, I could run Kubernetes locally (e.g. MiniKube), three VM cluster on Windows for an extra compare point.
Thank you for your response, we’ve deployed Dgraph to a dedicated node in Azure as a 4CPU 32GbRAM machine with premium-lrs disks, file descriptors are plentiful. Our Alphas are deployed in a statefulset with
- name: “compliance-dgraph-alpha”
resources:
requests:
memory: “24Gi”
cpu: “1”
limits:
memory: “24Gi”
cpu: “3.5”
And the Zero is
- name: “compliance-dgraph-zero”
resources:
requests:
memory: “128Mi”
cpu: “250m”
limits:
memory: “512Mi”
cpu: “500m”
It’s in AKS with a 9 node cluster with one of those as a dedicated dgraph node. The ratel, alpha, and zero run on it.
We run an upsert dgraph process which loads it as a live load.
The dataset is 8million objects we call matches (our main predicate)
corerrection, we also use another live loader that feeds in data from agents deployed in the field.
It is not the one we are using to do most of the initial data load.
Here is a typical message from our live loader
update-consumer INFO ≫ Update Consumer Record count 10000 took 515.6331701278687 secs
the times are around 510-540 seconds for 10000 records
The zero keeps restarting
badger 2020/10/16 21:59:59 INFO: Replaying file id: 5 at offset: 0
badger 2020/10/16 21:59:59 INFO: Replay took: 11.898033ms
I1016 22:00:00.915513 17 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0018367c0 Applied:1436271 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x2bcf318 DisableProposalForwarding:false}
I1016 22:00:00.915835 17 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1] Learners:[] XXX_unrecognized:[]} Index:1436271 Term:7 XXX_unrecognized:[]}
I1016 22:00:00.915865 17 node.go:317] Found hardstate: {Term:7 Vote:1 Commit:1436485 XXX_unrecognized:[]}
unexpected fault address 0x7fbe8b73609d
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7fbe8b73609d pc=0x11c338b]
goroutine 1 [running]:
runtime.throw(0x1b8ebef, 0x5)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc00049a3b0 sp=0xc00049a380 pc=0xa1af82
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:692 +0x443 fp=0xc00049a3e0 sp=0xc00049a3b0 pc=0xa32983
encoding/binary.bigEndian.Uint32(...)
/usr/local/go/src/encoding/binary/binary.go:113
Are there other zeros? What was the environment for this (window? ubuntu? docker-compose.yml? 6 node cluster?) Do you know how to reproduce, such as doing live loader, etc.)?
It looks like the data directory to another cluster (say cluster A) was mounted to this cluster (cluster B)? Did you have an earlier cluster with mounted directory or volume where the dgraph data was installed, and then start a new cluster with the same mounted directory or volume? The other scenario possibly is that the underlying volume or datadir got corrupt.
The solution to resolve this and get going again:
- if you have other zeros available is to go through a removeNode process on the faulty zero, and add a new zero node member. (safest way for HA scenario)
- if there’s only one zero, just remove it, and create a new zero with a new datadir (don’t use the existing datadir). (destructive way, for dev env)
Is this the full stack trace? This looks truncated.
This is an ubuntu node in a k8s cluster. We are running 1 alpha, 1 ratel, and 1 zero until we get a handle on how to manage Dgraph. This is the only cluster, but as I cycle out bad pods, I have been keeping the PVCs and trying to preserve data only to try and understand this better. Since a production env would need to be able to preserve data.
Thanks for the tip as I keep burning alphas to get the cluster back up. It has not occurred to me to burn the zero and the zero’s disk since it is not the persistent store.
It’s truncated
We would need a complete stack trace to figure out the cause of the crash.
The sigbus error happens when we have mapped a file to memory and the actual file on disk is deleted or truncated. For example, say you have a 1 gb file on disk mapped to memory. If you move this file while the program is running, you’ll get a sigbus error when you try to access the file.