CPU unutilised by dgraph alpha

I am trying to track down which transaction is the long-running one blocking the rest.

I can’t find processTask, StartTS, or MaxAssigned in the Prometheus logs. Are they in Jaeger but not Prometheus? I’m not familiar with Jaeger.

Yes, startTS and MaxAssigned are annotations on OpenTelemetry “spans” sampled during query processing. You need to add the tracing info (sampling ratio + Jaeger port to send it to) to the Dgraph startup line. Then you can easily grab a Jaeger stand-alone docker config and run that. (It holds data in memory rather than persisting it but is super easy to configure).

You can also turn on Request Monitoring (Logging - Deploy) to see what is happening.

I tried to curl an alpha’s web port for /z/tracez and I get a 404 page not found error. I didn’t realize the correct endoint is /debug/z/tracez.

> curl http://alpha1:8080/z/tracez
404 page not found

> curl http://alpha1:8080/debug/vars
{ "badger_v3_blocked_puts_total" : 0 ... }

> curl http://alpha1:8080/debug/prometheus_metrics
# HELP badger_blocked_puts_total Total number of blocked puts
# TYPE badger_blocked_puts_total untyped
badger_blocked_puts_total 0

> curl http://alpha1:8080/debug/z/tracez
<!DOCTYPE html>
<html lang="en"><head>
   <meta charset="utf-8">
   <title>Trace Spans</title>

Here are Spans with non-zero values when it appears to be running slow:

Span Name             Running
Badger.Compaction           1
Raftmessage1-to-2           1
Server.Mutate              10
api.Dgraph.Query           10
pb.Raft.Heartbeat          18
pb.Raft.RaftMessage        11
pb.Worker.Subscribe         6
pb.Zero.StreamMembership    1
worker.MutateOverNetwork   10

There’s Running, Latency Samples (>0s to 1m0s), and Error Samples. Do I need to increase the sampling to see what’s actively running if there’s a transaction causing lock contention?

Are there descriptions of what these different Span Names indicate/mean?

Because of the sampling (default is usually 0.01 or 1%) you won’t always see a particular slow update, but every query (Dgraph.Query) will have a bunch of ProcessTask spans, and those spans have drilldown info showing a log message of sorts for each stage of the ProcessTask.

A ProcessTask is the execution for one predicate (an edge or a scalar field) in the query.

If there is a large update blocking the Apply Channel, that will show up as all queries getting stuck, unless they are “best effort” queries that run on the current state and do not wait for the prior update to fully process.

By “stuck” I mean that the detailed annotations logged on the ProcessTask span will have the wating for StartTS and Done wating for MaxAssigned entries with a long delay between them, or even a timeout before MaxAssigned is available.

So you can detect a slow update by looking at the queries.

I’d really use Jaeger, since it is free, runs in a single, self-contained Docker container and is a much better GUI. I think other tools may also work to view OpenTelemetry data, such as Grafana, but I have not tried that.

I’ve got Jaeger installed and Dgraph publishing to it. Is there an API call to DGraph that can increase the sample rate instead of restarting the container all the time?

I haven’t dug into the ProcessTask’s yet. I do see a ProcessTag for each named predicated. Is that common?

I believe you need to restart the process, but that should be quick and rare - typically you set a level of sampling that won’t overwhelm the Jaeger agent, and leave it in place after that.

I’m not sure about the ProcessTag. In the Jaeger view, each query will have many spans named “ProcessTask.<namespace-id>.<predicateName>” which are key. If you are not using namespaces, the namespace-id will be 0.