Hello everyone, I am opening this topic to suggest the dgraph team to add a progress notification during very long operations that can take up to hours (either because the machine might be not extremely powerful or because the node is recovering from a very bad crash).
I’ve been bitten by this issue too many times until we eventually set up our monitoring of dgraph and sometimes I am still unsure whether alpha is running some operations or whether it got stuck somewhere for some reasons.
As an example, recently during very write-intensive operations a dgraph node completely blew up because of an OOM kill command issued by kubernetes.
The dgraph alpha went back online and it started to display “too many pending proposals. retry later” at intervals of 5 minutes while trying to abort old transactions.
The entire process took 1.5 hours to recover and we are running on SSD with 16CPU and 128GB of RAM, our dataset is also just 24-30 GB. This issue happened to us a few times before and therefore we got familiar with how to solve it (i.e. just wait patiently).
Looking at the logs the alpha node looks stuck in an unrecoverable state, truth is it is probably replaying all logs in the background and also dealing with snapshot routines.
This is exactly the reason why I would like to suggest adding a log notification layer for potentially very long operations. The user experience is just better if every 5-10 minutes a log is printed out to assure you that dgraph is not stuck but is performing some background processes.
This will also prove very useful to your developers as it should be immediately clear what dgraph is doing, why it is unresponding, and whether it got stuck, just by looking at the logs of your customers.
I am gonna bring Terraform as a very good example of excellent user experience:
Terraform tells you what it is doing, which resources are involved, and how long has elapsed from the beginning.
I would expect something like “dgraph is replaying logs… (5m elapsed, 12.345 logs replayed)” to drastically increase the UX during crash recoveries or in general long operations.
Finally, I understand that monitoring is the way to go in order to have a 360° overview of what is happening, but setting up monitoring is not trivial and current documentation is also not very complete.
Thank you.