The health endpoint does not represent the alpha state correctly unless the all parameter is used

zzl221000 · February 4, 2021, 3:46pm

The health endpoint responds in a timely manner, but when the all parameter is added, it always times out.
The node with this phenomenon cannot write data properly. Data reading was not tested.
Resolved after reboot.

What I Did

Removed and rejoined a node (not the current issue alpha), the group leader is normal, another follow has this problem

Dgraph Metadata

dgraph version

Dgraph version   : v20.11.0-rc5
raph codename  : tchalla
Dgraph SHA-256   : 95d845ecec057813d1a3fc94394ba1c18ada80f584120a024c19d0db668ca24e
Commit SHA-1     : b65a8b10c
Commit timestamp : 2020-12-14 19:09:28 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

MichelDiz · February 4, 2021, 5:30pm

If this is a bug report. Please put it at /Issues/Dgraph instead of /Users/Dgraph and also, please follow the template for bugs. That helps to reproduce and assign an engineer rapidly. The more information you provide, the better.

Cheers.

chewxy · February 5, 2021, 2:16am

What do you mean “all parameter is used”? I don’t understand this.

zzl221000 · February 5, 2021, 2:45am

@chewxy
/health?all returns information about the health of all the servers in the cluster.

chewxy · February 5, 2021, 2:52am

Ah, ok I see what you mean. Could you use the /graphql endpoint and do a health query?

chewxy · February 5, 2021, 3:02am

So there seems to be TWO issues:

The health endpoint does not represent alpha state correctly unless ?all is used
The health endpoint times out if ?all is used.

I don’t know what “represent alpha state correctly” means - perhaps you want to see Zero information as well? If so, then you are correct, and this is not a bug. But if there are some facts about the Alphas in the cluster that are wrong unless ?all is used, then it’s a bug.

On #2, this seems to be a bug. Tagging @ibrahim

zzl221000 · February 5, 2021, 6:12am

In fact, the node broke down at that time.
The request health status code is 200 and the request health?all keeps waiting until it times out.
I would say the request without the all parameter is not showing the node status correctly.

ibrahim · February 5, 2021, 2:47pm

Hey @zzl221000, this sounds like a bug. Could you please help me reproduce this?

Meanwhile, could you please help me with the following three items?

Output of curl -v 'localhost:8080/health?all' . Replace localhost:8080 with your alpha instance’s host:port.
Output of curl localhost:8080/debug/pprof/goroutine\?debug\=2 -o goroutine.txt . Please run this command after sending the curl request. The goroutine.txt file will show what’s running when you hit the /health?all endpoint. Please share the goroutine.txt file
Output of curl localhost:8080/debug/pprof/profile -o cpu.pprof. This is a cpu profile which will also help us figure out why it’s taking so long.

Please share the goroutine.txt and cpu.pproffiles generated in step 2 and 3.

Topic		Replies	Views
Zero's /health endpoint returns just "Ok" where the same Alpha's endpoint returns a JSON Dgraph dgraph , kind:enhancement , status:accepted , area:usability	7	615	June 30, 2020
More about Dgraph Alpha - Deploy Documentation	0	365	August 28, 2020
Health endpoint doesn't show a version when built from master Dgraph dgraph , status:accepted , exp:beginner , area:usability	16	982	August 5, 2020
About /health?all Dgraph kind:question , dgraph	4	509	October 23, 2020
Dgraph in k8s (GKE) health check issues Dgraph	1	505	July 15, 2021

The health endpoint does not represent the alpha state correctly unless the all parameter is used

What I Did

Dgraph Metadata

Related topics