Bug: Mutex lock while health checking


Report a Dgraph Bug

What version of Dgraph are you using?

v21.07.0-gf8681a22

Have you tried reproducing the issue with the latest release?

Since 21.03.0

What is the hardware spec (RAM, OS)?

Steps to reproduce the issue (command/config used to run Dgraph).

There is a http client from Go app that’s make requests to all three alpha nodes with a /health?all path to check status. After a while goroutines on one of alpha nodes starts increasing, because of mutex lock. Pprof says:

goroutine profile: total 460
310 @ 0xb6db45 0xb7f825 0xb7f80e 0xba1b27 0x19563e5 0x1956294 0x1d49736 0x1e139e5 0xe7cea4 0xe7ed2d 0x1deafd5 0xe7cea4 0xe7ed2d 0xe80463 0xe7b98d 0xba5d81
#	0xba1b26	sync.runtime_SemacquireMutex+0x46					/usr/lib/go-1.16/src/runtime/sema.go:71
#	0x19563e4	sync.(*RWMutex).RLock+0x1a4						/usr/lib/go-1.16/src/sync/rwmutex.go:63
#	0x1956293	github.com/dgraph-io/dgraph/conn.(*Pool).HealthInfo+0x53		/home/ngorohov/projects/go/dgraph/conn/pool.go:347
#	0x1d49735	github.com/dgraph-io/dgraph/edgraph.(*Server).Health+0x675		/home/ngorohov/projects/go/dgraph/edgraph/server.go:1168
#	0x1e139e4	github.com/dgraph-io/dgraph/dgraph/cmd/alpha.healthCheck+0x1e4		/home/ngorohov/projects/go/dgraph/dgraph/cmd/alpha/run.go:366
#	0xe7cea3	net/http.HandlerFunc.ServeHTTP+0x43					/usr/lib/go-1.16/src/net/http/server.go:2069
#	0xe7ed2c	net/http.(*ServeMux).ServeHTTP+0x1ac					/usr/lib/go-1.16/src/net/http/server.go:2448
#	0x1deafd4	github.com/dgraph-io/dgraph/ee/audit.AuditRequestHttp.func1+0x74	/home/ngorohov/projects/go/dgraph/ee/audit/interceptor_ee.go:91
#	0xe7cea3	net/http.HandlerFunc.ServeHTTP+0x43					/usr/lib/go-1.16/src/net/http/server.go:2069
#	0xe7ed2c	net/http.(*ServeMux).ServeHTTP+0x1ac					/usr/lib/go-1.16/src/net/http/server.go:2448
#	0xe80462	net/http.serverHandler.ServeHTTP+0xa2					/usr/lib/go-1.16/src/net/http/server.go:2887
#	0xe7b98c	net/http.(*conn).serve+0x8cc						/usr/lib/go-1.16/src/net/http/server.go:1952

To reproduce this bug I’ve started Apache Benchmark and wait for some time.
When it happens alpha node can not serve requests.

Expected behaviour and actual result.


Experience Report for Feature Request

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

What you actually did

Why that wasn’t great, with examples

Any external references to support your case

1 Like