The health endpoint does not represent the alpha state correctly unless the all parameter is used

The health endpoint responds in a timely manner, but when the all parameter is added, it always times out.
The node with this phenomenon cannot write data properly. Data reading was not tested.
Resolved after reboot.

What I Did

Removed and rejoined a node (not the current issue alpha), the group leader is normal, another follow has this problem

Dgraph Metadata

dgraph version
Dgraph version   : v20.11.0-rc5
raph codename  : tchalla
Dgraph SHA-256   : 95d845ecec057813d1a3fc94394ba1c18ada80f584120a024c19d0db668ca24e
Commit SHA-1     : b65a8b10c
Commit timestamp : 2020-12-14 19:09:28 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

If this is a bug report. Please put it at /Issues/Dgraph instead of /Users/Dgraph and also, please follow the template for bugs. That helps to reproduce and assign an engineer rapidly. The more information you provide, the better.


What do you mean “all parameter is used”? I don’t understand this.

/health?all returns information about the health of all the servers in the cluster.

Ah, ok I see what you mean. Could you use the /graphql endpoint and do a health query?

So there seems to be TWO issues:

  1. The health endpoint does not represent alpha state correctly unless ?all is used
  2. The health endpoint times out if ?all is used.

I don’t know what “represent alpha state correctly” means - perhaps you want to see Zero information as well? If so, then you are correct, and this is not a bug. But if there are some facts about the Alphas in the cluster that are wrong unless ?all is used, then it’s a bug.

On #2, this seems to be a bug. Tagging @ibrahim

In fact, the node broke down at that time.
The request health status code is 200 and the request health?all keeps waiting until it times out.
I would say the request without the all parameter is not showing the node status correctly.

Hey @zzl221000, this sounds like a bug. Could you please help me reproduce this?

Meanwhile, could you please help me with the following three items?

  1. Output of curl -v 'localhost:8080/health?all' . Replace localhost:8080 with your alpha instance’s host:port.
  2. Output of curl localhost:8080/debug/pprof/goroutine\?debug\=2 -o goroutine.txt . Please run this command after sending the curl request. The goroutine.txt file will show what’s running when you hit the /health?all endpoint. Please share the goroutine.txt file
  3. Output of curl localhost:8080/debug/pprof/profile -o cpu.pprof. This is a cpu profile which will also help us figure out why it’s taking so long.

Please share the goroutine.txt and cpu.pproffiles generated in step 2 and 3.