Add progress notification for very long operations

Hello everyone, I am opening this topic to suggest the dgraph team to add a progress notification during very long operations that can take up to hours (either because the machine might be not extremely powerful or because the node is recovering from a very bad crash).

I’ve been bitten by this issue too many times until we eventually set up our monitoring of dgraph and sometimes I am still unsure whether alpha is running some operations or whether it got stuck somewhere for some reasons.

As an example, recently during very write-intensive operations a dgraph node completely blew up because of an OOM kill command issued by kubernetes.

The dgraph alpha went back online and it started to display “too many pending proposals. retry later” at intervals of 5 minutes while trying to abort old transactions.

The entire process took 1.5 hours to recover and we are running on SSD with 16CPU and 128GB of RAM, our dataset is also just 24-30 GB. This issue happened to us a few times before and therefore we got familiar with how to solve it (i.e. just wait patiently).

Looking at the logs the alpha node looks stuck in an unrecoverable state, truth is it is probably replaying all logs in the background and also dealing with snapshot routines.

This is exactly the reason why I would like to suggest adding a log notification layer for potentially very long operations. The user experience is just better if every 5-10 minutes a log is printed out to assure you that dgraph is not stuck but is performing some background processes.

This will also prove very useful to your developers as it should be immediately clear what dgraph is doing, why it is unresponding, and whether it got stuck, just by looking at the logs of your customers.

I am gonna bring Terraform as a very good example of excellent user experience:
image

Terraform tells you what it is doing, which resources are involved, and how long has elapsed from the beginning.

I would expect something like “dgraph is replaying logs… (5m elapsed, 12.345 logs replayed)” to drastically increase the UX during crash recoveries or in general long operations.

Finally, I understand that monitoring is the way to go in order to have a 360° overview of what is happening, but setting up monitoring is not trivial and current documentation is also not very complete.

Thank you.

2 Likes

I love this idea. Bringing it up with the core team to see if we can get this in

1 Like

Today Dgraph shares some info about background tasks at /state if I’m not wrong. But I feel it is incomplete. We could improve it and be dumping any task there all the time. Cuz we could expose it to Ratel (making nice visual feedback) or users could create their own dashboard (or bots, or any other automation) to follow it.

Also, I have created a similar ticket last year. Show query progress, specifically for long running queries - My point was about queries, which suffers from the same lack of “What is going on?”.

2 Likes

I think it should be two separate tickets.

1 Like

I wasn’t aware that there was an endpoint I could hit to get information about background tasks being carried out by dgraph. This is actually really good news already.

I would still like to suggest that logs are where sysadmins and developers normally look at immediately when something is not working properly, and I still believe that there should be some kind of feedback also there.

Just a quick question for you guys. When doing a live loader is there a way to parse the processed_percent to an actual 1-100 based percentage? Seems like it’s using some other metric:

Screen Shot 2020-12-14 at 5.12.45 PM