[20.07] Guidence on major performance bottlenecks

I’m not sure where to even start, I’ve come into a job recently that’s been using dgraph in production (20.07) in their authentication service for some time now.

As the data has grown (~1TB now), the service has been impacted pretty significantly.

Those impacts include:

  1. K8s alpha/zero pods restarting every other day
  2. Major CPU Spikes causing complete downtime
  3. Write Performance degredation

It seems when writes are happening on the DB, the entire DB slows down causing reads to be very slow. When there’s brief times with no writes, reads will speed back up again and be relatively fast.

My question for the community would be, do you have any guidance on how i could possibly go about optimizing this thing?

Solutions we’ve attempted so far:

  1. increasing RAM usage from 32GB to 64GB (did not seem to help)

Some of the issues we’ve been experiencing have been noted by others in the community in the past:

  1. Raft.Ready took too long to process:
  2. Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
  3. High memory utilization on alpha node (use of memory cache)
  4. Context Exceeded recurring again

Some things I have notices:

  1. when CPU spikes to 100% utilization, I see these logs.
I1027 08:30:30.709017      17 groups.go:900] Got Zero leader: dgraph-zero-2.dgraph-zero.dgraph-prod.svc.cluster.local:5080
W1027 08:30:30.749268      17 draft.go:1245] Raft.Ready took too long to process: Timer Total: 703ms. Breakdown: [{disk 703ms} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I1027 08:30:33.621398      19 groups.go:956] Zero leadership changed. Renewing oracle delta stream.
E1027 08:30:33.621520      19 groups.go:932] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
W1027 08:30:36.046003      19 draft.go:1245] Raft.Ready took too long to process: Timer Total: 11.395s. Breakdown: [{disk 11.395s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I1027 08:30:38.626573      19 log.go:34] 2 is starting a new election at term 152
  1. Raft.Ready took too long to process log occurs very frequently.

A solution we are considering is obtaining an enterprise license and attempting to use Learner Nodes as read only nodes to help users not feel the writes and performance degredation. Does anyone have input on this?

Current Deployment stack is the following:

  • Amazon EKS Kubernetes cluster using the HA Cluster setup.
  • 3 Dgraph alpha pods in 1 Alpha Group
  • 3 Dgraph zero pods in 1 Zero Group
  • 970Gi Storage using AWS EBS Volume GP2 for alpha pods
  • 170Gi Storage using AWS EBS Volume GP2 for zero pods

We’ve explored switching to IOPS SSD EBS Volumes, however we’ve noticed the GP2 Volumes we are using have almost 100% burst balance - so our assumption is that this is not where the bottleneck lies.

1 Like

Observation: This version has not been supported for a long time. The end of her life was July 2021.

Many of these things have been worked on in newer versions. If you encounter the same issue in newer versions it would help.

Latency can increase if there is a lot of writing, the way to mitigate this is to have multiple Alphas instances and do load balancing. Rather than focusing on just one instance.

Analyzing to try to help.

You have 1TB of disk. It requires a lot more RAM than that. Unless this is 64GB for each instance. Which I doubt, since you said you’re using k8s.

This error happens when latency is high. It could be disk or overload problem. I remember there was a PR merged on this a long time ago. Not sure what it was doing. But it was related to that log.

If your zero is having this problem, check if it needs more resources.

He’s using K8s. And probably giving a very high cache RAM setting without taking into account the limits that each Pod has. If you don’t configure it correctly, Dgraph will think that all memory is available for it. It doesn’t know how to read the POD context, but the system context. The ideal is that you evaluate the limits of the Pod and then set the flag correctly.

For example if I have a k8s with 256GB of RAM. And I have 19 Alphas. I would configure each Alpha to be 13.4GB max. Or 80% of that.

This overly high memory in Grafana could be a bug in the way we capture metrics too. It is something to be evaluated.

Are you using our Cloud?

Update: " * Amazon EKS Kubernetes cluster using the HA Cluster setup." - So you don’t. That issue was about our cloud tho.

That would be nice. If you create a write-only cluster and have read-only Learner Nodes, this will help a lot. But you can do this with Alpha instances, you don’t have to be a Learner. But with Learner I think you have more advantages as it is not officially part of the Cluster and its background activities. They only route data (write), and serve data for reading.

So, are you using “replica=3”???

Being completely opinionated below, following my own experience.

I would recommend adding more alphas, replicas from the same group work intensively - as they need to replicate 100% of the time there is a new write. You need more groups(shards). I would recommend 9 Alphas. Or you can keep it like this (3 Alphas) and add more Learner Nodes.

I know our documentation and examples show using only 3 Alphas. But I don’t particularly like the idea of having 3 replicas and only 3 Alphas. Shards make everything more performative. Pure replicas only increase the intensive use of the cluster. This use will not disappear for good, as every replica needs to keep replicating. But it will lower latency if you have a more balanced cluster with shards.

How much memory, and how much is available for each pod?

BTW! how big is your wall files? the ones you find at /w and /zw paths.

1 Like

I keep saying that this was not only on Cloud, but no one wants to listen to the truth.

2 Likes

We are working on creating a separate cluster that is an exact copy of our production cluster to test upgrades, and different resource allocations. I will be sure to post updates as I go.

We currently have three nodes with 64GB each, so each alpha instance is within it’s own node and can utilize up to 64GB.

Mentioned above, 64GB currently for each pod.

it looks like the .sst files in /w on the alpha pods is averaging roughly 65M per file, same for the /zw path in the zero pods.

I can do some testing with this in our test cluster.

something interesting I’ve noticed while creating a copy of our production cluster and importing the production data:

Storage usage on Production:

Storage usage on our copy of Production:

What would cause this to happen? Where one alpha pod is taking all of the storage usage compared to the others. I assume all alpha pods should distribute the data equally, like you see in the second image.

I mean the entire folder. Sorry. Well, the wal by default is maintained in RAM. So the bigger this folder gets. More RAM will consume. It decreases over time. But you can completely remove this. But, despite decreasing the use of RAM, it can lead to other problems if the disk is too slow.
You can use --badger.vlog=disk and it will avoid excessive RAM usage.

It can be accumulation of wal, logs and etc. They cannot be 100% the same as each one generates a series of logs and deltas that are different from each other according to their use. Instances are working all the time. And they are independent. They decide what to do with the data. But in theory, they have the same data because they are replicas.

If you focus on just one Alpha to receive data. That Alpha is going to increase disk usage considerably. While the others will be relatively passive.

How did you copy it? Export and reimport right? How big is the exported data? if you untar it you will have the idea of the size of your data. As this is RAW RDF. If you have 200GB of exported RDF. You can consider a 10% loss as in Dgraph data is not kept in RDF on disk. So at 200GB you would actually have 180GB.

Based on this graph it seems to me that you don’t actually have 1TB of data.

My apologies, that graph seems like it might not be a good representation of disk usage, since those graphs don’t show PVC disk usage.

In our production environment I can see the following usage for each alpha pod is:

dgraph-alpha-0 - 104GB
dgraph-alpha-1 - 677GB
dgraph-alpha-2 - 152GB

this is interesting, I will test it out.

I’ve got the same problem) no community response, so i decided to rewrite the application to mongodb:)

Did you upgrade to 21.03?

You never asked for any help after the upgrade? If you’re a paying user, there certainly will be help there too as a support ticket.

FWI - MongoDB would be a terrible choice if you truly have relational data.

J

is an export-import needed to upgrade from 20.07 to 21.03?

Yes, there is a tool that you can use to upgrade dgraph/upgrade at main · dgraph-io/dgraph · GitHub

You can see that are change_v20.07.0.go and change_v21.03.0.go. Which means they changed a lot. So we had to hardcode it.

Yes

@MichelDiz

any idea why this would happen with the bulk loader?

Update: just a corrupt export file via kubectl cp disregard

2022/11/09 14:37:36 unexpected EOF

github.com/dgraph-io/dgraph/x.Check
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).mapStage.func2
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:290
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1371
command terminated with exit code 1

This is a file system error. See if the file ends with an extra empty line.

If that doesn’t work for you, try dos2unix to convert to unix type file

dos2unix myRDF.rdf

This kind of problem usually happens to those who use Windows and merge files when going to send it to Linux. Windows file system is not fully compatible with unix like systems. Another reason is the lack of extra line at the end of the text. Or some file encoding issue.

BTW, EOF means End-of-file.

Any idea what could cause this error?

W1116 14:09:14.505741      22 mvcc.go:148] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 000000000000000000000a7375626368616e6e656c020be6a2a7281707b12d5c44315845a63651602dc8e387693161103326263ef64cb2: could not read list part with key 040000000000000000000a7375626368616e6e656c020be6a2a7281707b12d5c44315845a63651602dc8e387693161103326263ef64cb20000000000000001: Key not found rolling up key [0 0 0 0 0 0 0 0 0 0 10 115 117 98 99 104 97 110 110 101 108 2 11 230 162 167 40 23 7 177 45 92 68 49 88 69 166 54 81 96 45 200 227 135 105 49 97 16 51 38 38 62 246 76 178]

We ended up upgrading to 21.03.2 and rebuilt our production DB and saw significant improvements. But today we are seeing major CPU throttling again with alot of the error above being thrown.

Any insights would be appreciated.

Update: seems this error is thrown during mutations.

To provide more context, I don’t think the previous error i posted as much to do with the major CPU/MEM spike we saw yesterday.

From what I can see, the below logs happened when we was the spike (during this spike we did notice a brief downtime of dgraph for about 5-7 minutes)

Please correct me if I’m wrong, but from what i can see in the logs, this is the process of selecting a new leader (i would be curious to know what triggers dgraph to change leadership?)

 
November 16, 2022 14:21:54.306
I1116 13:21:54.306007      22 groups.go:906] Leader idx=0x3 of group=1 is connecting to Zero for txn updates
 
November 16, 2022 14:21:54.455
I1116 13:21:54.455539      22 groups.go:918] Got Zero leader: dgraph-zero-0.dgraph-zero.dgraph-production.svc.cluster.local:5080
 
November 16, 2022 14:21:55.319
I1116 13:21:55.319250      22 groups.go:974] Zero leadership changed. Renewing oracle delta stream.
 
November 16, 2022 14:21:55.320
E1116 13:21:55.319960      22 groups.go:950] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
 
November 16, 2022 14:21:56.295
I1116 13:21:56.294888      22 groups.go:906] Leader idx=0x3 of group=1 is connecting to Zero for txn updates
 
November 16, 2022 14:21:56.450
I1116 13:21:56.450030      22 groups.go:918] Got Zero leader: dgraph-zero-0.dgraph-zero.dgraph-production.svc.cluster.local:5080
 
November 16, 2022 14:21:57.317
I1116 13:21:57.317130      22 groups.go:974] Zero leadership changed. Renewing oracle delta stream.
 
November 16, 2022 14:21:57.324
E1116 13:21:57.324184      22 groups.go:950] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
 
November 16, 2022 14:21:58.534
I1116 13:21:58.534606      22 groups.go:906] Leader idx=0x3 of group=1 is connecting to Zero for txn updates
 
November 16, 2022 14:21:58.708
I1116 13:21:58.707997      22 groups.go:918] Got Zero leader: dgraph-zero-0.dgraph-zero.dgraph-production.svc.cluster.local:5080
 
November 16, 2022 14:21:59.309
I1116 13:21:59.309002      22 groups.go:974] Zero leadership changed. Renewing oracle delta stream.
 
November 16, 2022 14:21:59.309
E1116 13:21:59.309543      22 groups.go:950] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
 
November 16, 2022 14:22:00.295
I1116 13:22:00.295250      22 groups.go:906] Leader idx=0x3 of group=1 is connecting to Zero for txn updates
 
November 16, 2022 14:22:00.415
I1116 13:22:00.415809      22 groups.go:918] Got Zero leader: dgraph-zero-0.dgraph-zero.dgraph-production.svc.cluster.local:5080
 
November 16, 2022 14:22:01.291
I1116 13:22:01.291660      22 groups.go:974] Zero leadership changed. Renewing oracle delta stream.
 
November 16, 2022 14:22:01.296
E1116 13:22:01.296650      22 groups.go:950] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
 
November 16, 2022 14:22:04.180
I1116 13:22:04.180143      22 groups.go:906] Leader idx=0x3 of group=1 is connecting to Zero for txn updates
 
November 16, 2022 14:22:04.180
I1116 13:22:04.180376      22 groups.go:918] Got Zero leader: dgraph-zero-0.dgraph-zero.dgraph-production.svc.cluster.local:5080 

This has happened again to us today, and it’s causing down times for our users for roughly 10 minutes, and then dgraph stabilizing again.

Could really use some insights into this issue.

@rarvikar

any ideas?

Some possible reasons one would see RAFT leader changes:

  • network segmentation (any chance that your zeros are getting isolated for more than a few seconds?)
  • workload, if the zeroes are just outright hammered with more work than their CPUs permit, they could fail to take part in the election cycle which runs surprisingly often. Have you read thru https://dgraph.io/docs/design-concepts/raft/?

Hi Brandon - it’s very hard to say without looking at the system. I’m pretty sure " rpc error: code = Canceled desc = context canceled
means that your network requests (over gRPC) are timing out or being closed for some reason. And the selection of a new leader is triggered by nodes not hearing heartbeat messages from one another. So all this suggests the computers and network are unable to keep up.

You could have a bad network, or you could be overloading your machines so they can’t serve network requests, or you could be swapping, or IO could be maxed out, etc. etc.

I would start looking for bottlenecks at the system level to understand what resource you are running out of. kswapd activity in top output suggests memory pressure and swapping. sar will tell you if IO is maxed (esp await times) and will indicate CPU limitations also (and sar will report on swap too). If you see a particular resource is overwhelmed it can suggest how to either get more capacity or change your system/code/queries to use less of it.

If you are working for a paying Dgraph customer you can submit a support ticket, and even if not you may want to get some professional services support to look into this if it is critical. Let me know if I can help with either of those, but of course the community input here may help also.