Advice for alerts in Grafana via Prometheus data

emhagman · September 16, 2018, 2:46pm

Hey,

Was just wondering if you had any tips on what to look out for as a “healthy” dgraph cluster with the data that is provided via Prometheus. I have alerts for memory and the generic health variable that returns 1 or 0 but didn’t know if you had any other tips on what to look out for?

dmai · September 17, 2018, 6:35pm

Memory metrics and dgraph_server_health_status are the ones you want to check for healthiness of the cluster.

The metrics dgraph_evicted_lists_total indicates the total number of posting lists evicted from LRU cache. This can necessitate reconfiguring the LRU cache size and potentially increasing the resources for Dgraph.

Similarly, dgraph_cache_hits_total, dgraph_cache_miss_total, and dgraph_cache_race_total will give you an idea of LRU cache size tuning.

High numbers for dgraph_pending_proposals_total and dgraph_pending_queries_total indicate a busy Dgraph cluster where the load is high and the Dgraph Server needs time to catch up.

emhagman · September 21, 2018, 1:40am

Thanks!!! Is there anything to be done about dgraph_pending_proposals_total and dgraph_pending_queries_total being high? Adding more duplicates? Sharding?

Since I have your attention, what do you guys recommend to run in production? Currently I have a 3 replica cluster setup with one zero. I assume I should at least being using HA (how many zeros?)

dmai · September 21, 2018, 6:47pm

Depends. Could be for any number of factors, like CPU/memory saturation, network delays, or disk IO throttling. Best to measure and make adjustments accordingly. On top of cpu/memory/network/disk metrics, the /debug/requests and /debug/events pages can point you in the right direction on what’s taking time.

Three Dgraph Alphas (Server) and one Dgraph Zero with replication setting of three is adequate. But if the one Zero becomes unavailable, then the cluster is effectively unavaliable until the Zero comes back.

Truer high availability for any of the Dgraph instances would be to run three Zeros and three Alphas. That way one Zero or one Alpha can go down at any given time and the cluster will still be up.

Topic		Replies	Views
The LRU metrics are missing when we use Prometheus Users	5	538	December 20, 2019
High memory utilization on alpha node (use of memory cache) Dgraph	8	1325	February 16, 2022
Metric dgraph_max_assigned_ts alert Dgraph kind:question	5	374	May 21, 2021
Monitoring - Deploy Documentation	0	373	August 28, 2020
Dgraph v1.0.10-rc1 release candidate Announce	8	1418	November 9, 2018

Advice for alerts in Grafana via Prometheus data

Related topics