Schema Alter operation causing high query/mutation latencies

We have an application where we need to perform regular schema alter operations to isolate some predicates for new datasets. This alter operation creates a new set of predicates. We have observed following behaviour whenever alter operation runs:

  1. increase in pending queries: rate(dgraph_pending_queries_total)
  2. Spike in go routines
  3. Spike in server latencies: dgraph_latency_bucket, dgraph_grpc_io_server_server_latency_bucket

As per the documentation schema alter ops when run in background should not impact ongoing query/mutations.

Cluster details:

NOTE: Expecting pointers regarding alter op usage without compromising query/mutation performance

Hey @killerknv

Welcome to the Dgraph community. We hope to see you around here more often.

Now, when you say spike in goroutines, where are you seeing that? Goroutines are not like regular threads - you could spawn a million of them and they should have minimal impact on the running of the program if the program is structured well (i.e no communication deadlock bugs)

More concerning is the spike in latencies. @ibrahim - ideas?

@chewxy Number of go routines (for dgraph alpha pods) are not really an issue, just an observation. Its just that it corresponds to latency increase in most cases

Hey @killerknv, the spike in goroutines is expected because of the re-indexing.
Can you collect some traces from https://dgraph.io/docs/deploy/tracing/#examining-traces-with-jaeger while the alter operation is active?
I’m looking for spans that are the slowest.

@ibrahim While enabling the tracer on production might take some time, I wanted to have a better understanding on how schema updates work.
Same schema updates take ~2 sec on a fresh dgraph cluster, where as it takes around 60s - 100s on a cluster that has considerable amount of data and receiving query/mutation requests.
Considering the fact that schema updates have addition of new predicates, It should not impact operations on existing predicates and update time should not be increase by this much.

@ibrahim I tried to reproduce the same behaviour on a test setup. alter op response time increased from 600ms to 5.24 s.
Also observed increased latencies for ongoing query/mutations.

jaeger trace for alter op

trace for Server.mutate

@ashishgoswami would have more information about this but I think we put mutations on hold while we’re indexing data.

New predicates could be associated with old nodes. In that case your query response would change and dgraph needs to handle that too.

From the spans, it looks like a lot of time was spent waiting for the maxAssigned. @ashishgoswami is the delay expected?

@ashishgoswami @ibrahim any updates here ?

We are also noticing very high dgraph grpc server latencies (dgraph_grpc_io_server_server_latency_bucket) after a peak in edge creation rate.

Sequence of events seems to be as following:
Increase in number of mutations (edge creation rate) → consistent increase in pending queries → high server latencies

We were able to recover completely from this only after rolling restart of the alpha nodes.

Hi @killerknv,

Thanks for raising the issue.

Dgraph uses Raft protocol internally. It acts runs as a state machine. Hence at any point of time only one mutation is applied(however same mutation is broken into multiple parts and applied concurrently).
This can be the reason for difference in time when alter is run on fresh cluster and alter is run on cluster with running mutations/queries. Although difference mentioned by you is very high(2s vs 60s). Even though predicates are new, we do some checks for them like cancelling in progress transactions for predicates etc.
Can you share your benchmarks with us? We would like to reproduce this at our end. I am not very sure if the latency difference should be this big. Also I feel that increase in mutations/pending queries should be for sometime, but eventually everything should run normally.

1 Like

@ashishgoswami Let me know what all details you need regarding benchmark. The metrics shared are from our production cluster.
Total number of predicates: 4882
Number of predicates per schema alter operation: 16
Most of our mutations span across 3-4 predicates.

Pending queries, latencies recovered once we restarted the pods. This also looks like a problem with default helm charts, liveliness, readiness probes are disabled. They could have been useful if an alpha process gets stuck.
Is there any recommended configuration for readiness, liveliness probe that we can use. Not sure of the story behind disabling these.

Hi @killerknv,

Thanks for providing the above information. As you said, you are able to reproduce this on test setup, would it be possible to share your test scripts with us? You can share those by personal message.

We had reproduced the issue on test deployment of a microservice responsible for generating mutations and alter ops. I will need to reproduce it using a simpler script which can be easily shared.

@ashishgoswami Meanwhile is there a recommended configuration for readiness, liveliness probes that we can use. I could use default values after enabling the probes.

Tagging @dmai for the 2nd part.

@ashishgoswami I have shared the script to reproduce schema alter operations via message.
Running the same set of schema alter ops even on a fresh cluster causes very high memory usage on alpha pods and eventually gets stuck in state where no more alter ops are possible.

Memory profile for the same:

@ashishgoswami Couldn’t reproduce the same on v1.2.7. Its looks like the issue is limited to v20.x.x

Hi Kesharee, Could you please also share the resources view (RAM, CPU) for each alpha and zero? There are a few switches that have been recently added that might help, and I would like to compare.

@anand Following is the cluster configuration:

alpha pods: 12 * (14 G, 8 cpu)
zero pods: 3 * (6G, 4 cpu)
shard replica count: 3

From the profile, it looks like there were too many memtables (the newArena). This will happen when a lot of badger instances are opened at the same time. Each re-indexing call opens a new badger DB and that’s why we’re seeing high memory usage.

v1.2.7 should have the same issue since we do reindexing the same way in all the versions.

We’re starting too many builders in the background and that’s causing too many badger DB which leads to heavy memory usage.

@killerknv I’ll accept this as a bug and we’ll try to resolve this soon.