Was just wondering if you had any tips on what to look out for as a “healthy” dgraph cluster with the data that is provided via Prometheus. I have alerts for memory and the generic health variable that returns 1 or 0 but didn’t know if you had any other tips on what to look out for?
Memory metrics and
dgraph_server_health_status are the ones you want to check for healthiness of the cluster.
dgraph_evicted_lists_total indicates the total number of posting lists evicted from LRU cache. This can necessitate reconfiguring the LRU cache size and potentially increasing the resources for Dgraph.
dgraph_cache_race_total will give you an idea of LRU cache size tuning.
High numbers for
dgraph_pending_queries_total indicate a busy Dgraph cluster where the load is high and the Dgraph Server needs time to catch up.
Thanks!!! Is there anything to be done about
dgraph_pending_queries_total being high? Adding more duplicates? Sharding?
Since I have your attention, what do you guys recommend to run in production? Currently I have a 3 replica cluster setup with one zero. I assume I should at least being using HA (how many zeros?)
Depends. Could be for any number of factors, like CPU/memory saturation, network delays, or disk IO throttling. Best to measure and make adjustments accordingly. On top of cpu/memory/network/disk metrics, the /debug/requests and /debug/events pages can point you in the right direction on what’s taking time.
Three Dgraph Alphas (Server) and one Dgraph Zero with replication setting of three is adequate. But if the one Zero becomes unavailable, then the cluster is effectively unavaliable until the Zero comes back.
Truer high availability for any of the Dgraph instances would be to run three Zeros and three Alphas. That way one Zero or one Alpha can go down at any given time and the cluster will still be up.