Bug: Mutex lock while health checking


Report a Dgraph Bug

What version of Dgraph are you using?

v21.07.0-gf8681a22

Have you tried reproducing the issue with the latest release?

Since 21.03.0

What is the hardware spec (RAM, OS)?

Steps to reproduce the issue (command/config used to run Dgraph).

There is a http client from Go app that’s make requests to all three alpha nodes with a /health?all path to check status. After a while goroutines on one of alpha nodes starts increasing, because of mutex lock. Pprof says:

goroutine profile: total 460
310 @ 0xb6db45 0xb7f825 0xb7f80e 0xba1b27 0x19563e5 0x1956294 0x1d49736 0x1e139e5 0xe7cea4 0xe7ed2d 0x1deafd5 0xe7cea4 0xe7ed2d 0xe80463 0xe7b98d 0xba5d81
#	0xba1b26	sync.runtime_SemacquireMutex+0x46					/usr/lib/go-1.16/src/runtime/sema.go:71
#	0x19563e4	sync.(*RWMutex).RLock+0x1a4						/usr/lib/go-1.16/src/sync/rwmutex.go:63
#	0x1956293	github.com/dgraph-io/dgraph/conn.(*Pool).HealthInfo+0x53		/home/ngorohov/projects/go/dgraph/conn/pool.go:347
#	0x1d49735	github.com/dgraph-io/dgraph/edgraph.(*Server).Health+0x675		/home/ngorohov/projects/go/dgraph/edgraph/server.go:1168
#	0x1e139e4	github.com/dgraph-io/dgraph/dgraph/cmd/alpha.healthCheck+0x1e4		/home/ngorohov/projects/go/dgraph/dgraph/cmd/alpha/run.go:366
#	0xe7cea3	net/http.HandlerFunc.ServeHTTP+0x43					/usr/lib/go-1.16/src/net/http/server.go:2069
#	0xe7ed2c	net/http.(*ServeMux).ServeHTTP+0x1ac					/usr/lib/go-1.16/src/net/http/server.go:2448
#	0x1deafd4	github.com/dgraph-io/dgraph/ee/audit.AuditRequestHttp.func1+0x74	/home/ngorohov/projects/go/dgraph/ee/audit/interceptor_ee.go:91
#	0xe7cea3	net/http.HandlerFunc.ServeHTTP+0x43					/usr/lib/go-1.16/src/net/http/server.go:2069
#	0xe7ed2c	net/http.(*ServeMux).ServeHTTP+0x1ac					/usr/lib/go-1.16/src/net/http/server.go:2448
#	0xe80462	net/http.serverHandler.ServeHTTP+0xa2					/usr/lib/go-1.16/src/net/http/server.go:2887
#	0xe7b98c	net/http.(*conn).serve+0x8cc						/usr/lib/go-1.16/src/net/http/server.go:1952

To reproduce this bug I’ve started Apache Benchmark and wait for some time.
When it happens alpha node can not serve requests.

Expected behaviour and actual result.


Experience Report for Feature Request

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

What you actually did

Why that wasn’t great, with examples

Any external references to support your case

I built Dgraph with “-race” flag and here is what I’ve got during making request at “/health?all” path:

WARNING: DATA RACE
alpha3_1  | Write at 0x00c000296e28 by goroutine 340:
alpha3_1  |   github.com/dgraph-io/dgraph/conn.(*Pool).HealthInfo()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/conn/pool.go:358 +0x1a9
alpha3_1  |   github.com/dgraph-io/dgraph/edgraph.(*Server).Health()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/edgraph/server.go:1168 +0x93d
alpha3_1  |   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.healthCheck()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/dgraph/cmd/alpha/run.go:366 +0x274
alpha3_1  |   net/http.HandlerFunc.ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha2_1  |       /home/ngorohov/projects/go/dgraph/dgraph/cmd/alpha/run.go:366 +0x274
alpha3_1  |   net/http.(*ServeMux).ServeHTTP()
alpha2_1  |   net/http.HandlerFunc.ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha1_1  |   net/http.(*conn).serve()
alpha1_1  |       /usr/lib/go-1.16/src/net/http/server.go:1952 +0x87d
alpha3_1  |   github.com/dgraph-io/dgraph/ee/audit.AuditRequestHttp.func1()
alpha1_1  | 
alpha2_1  |   net/http.(*ServeMux).ServeHTTP()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/ee/audit/interceptor_ee.go:91 +0xb3
alpha3_1  |   net/http.HandlerFunc.ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha1_1  | Goroutine 23 (running) created at:
alpha2_1  |   github.com/dgraph-io/dgraph/ee/audit.AuditRequestHttp.func1()
alpha2_1  |       /home/ngorohov/projects/go/dgraph/ee/audit/interceptor_ee.go:91 +0xb3
alpha1_1  |   net/http.(*Server).Serve()
alpha2_1  |   net/http.HandlerFunc.ServeHTTP()
alpha3_1  |   net/http.(*ServeMux).ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha3_1  |   net/http.serverHandler.ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2887 +0xca
alpha3_1  |   net/http.(*conn).serve()
alpha1_1  |       /usr/lib/go-1.16/src/net/http/server.go:3013 +0x644
alpha1_1  |   github.com/dgraph-io/dgraph/x.startListen()
alpha1_1  |       /home/ngorohov/projects/go/dgraph/x/server.go:71 +0x144
alpha1_1  | 
alpha1_1  | Goroutine 357 (running) created at:
alpha1_1  |   net/http.(*Server).Serve()
alpha1_1  |       /usr/lib/go-1.16/src/net/http/server.go:3013 +0x644
alpha1_1  |   github.com/dgraph-io/dgraph/x.startListen()
alpha1_1  |       /home/ngorohov/projects/go/dgraph/x/server.go:71 +0x144
alpha1_1  | ==================
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:1952 +0x87d
alpha3_1  | 
alpha3_1  | Previous write at 0x00c000296e28 by goroutine 262:
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha2_1  |   net/http.(*ServeMux).ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha2_1  |   net/http.serverHandler.ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2887 +0xca
alpha2_1  |   net/http.(*conn).serve()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:1952 +0x87d
alpha2_1  | 
alpha2_1  | Previous write at 0x00c0000e9068 by goroutine 338:
alpha2_1  |   github.com/dgraph-io/dgraph/conn.(*Pool).HealthInfo()
alpha2_1  |       /home/ngorohov/projects/go/dgraph/conn/pool.go:358 +0x1a9
alpha2_1  |   github.com/dgraph-io/dgraph/edgraph.(*Server).Health()
alpha2_1  |       /home/ngorohov/projects/go/dgraph/edgraph/server.go:1168 +0x93d
alpha2_1  |   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.healthCheck()
alpha2_1  |       /home/ngorohov/projects/go/dgraph/dgraph/cmd/alpha/run.go:366 +0x274
alpha2_1  |   net/http.HandlerFunc.ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha2_1  |   net/http.(*ServeMux).ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha2_1  |   github.com/dgraph-io/dgraph/ee/audit.AuditRequestHttp.func1()
alpha2_1  |       /home/ngorohov/projects/go/dgraph/ee/audit/interceptor_ee.go:91 +0xb3
alpha2_1  |   net/http.HandlerFunc.ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha2_1  |   net/http.(*ServeMux).ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha2_1  |   net/http.serverHandler.ServeHTTP()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:2887 +0xca
alpha2_1  |   net/http.(*conn).serve()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:1952 +0x87d
alpha2_1  | 
alpha2_1  | Goroutine 340 (running) created at:
alpha3_1  |   github.com/dgraph-io/dgraph/conn.(*Pool).HealthInfo()
alpha2_1  |   net/http.(*Server).Serve()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/conn/pool.go:358 +0x1a9
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:3013 +0x644
alpha2_1  |   github.com/dgraph-io/dgraph/x.startListen()
alpha2_1  |       /home/ngorohov/projects/go/dgraph/x/server.go:71 +0x144
alpha3_1  |   github.com/dgraph-io/dgraph/edgraph.(*Server).Health()
alpha2_1  | 
alpha3_1  |       /home/ngorohov/projects/go/dgraph/edgraph/server.go:1168 +0x93d
alpha2_1  | Goroutine 338 (running) created at:
alpha3_1  |   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.healthCheck()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/dgraph/cmd/alpha/run.go:366 +0x274
alpha2_1  |   net/http.(*Server).Serve()
alpha2_1  |       /usr/lib/go-1.16/src/net/http/server.go:3013 +0x644
alpha3_1  |   net/http.HandlerFunc.ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha3_1  |   net/http.(*ServeMux).ServeHTTP()
alpha2_1  |   github.com/dgraph-io/dgraph/x.startListen()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha2_1  |       /home/ngorohov/projects/go/dgraph/x/server.go:71 +0x144
alpha3_1  |   github.com/dgraph-io/dgraph/ee/audit.AuditRequestHttp.func1()
alpha2_1  | ==================
alpha3_1  |       /home/ngorohov/projects/go/dgraph/ee/audit/interceptor_ee.go:91 +0xb3
alpha3_1  |   net/http.HandlerFunc.ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2069 +0x51
alpha3_1  |   net/http.(*ServeMux).ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2448 +0xaf
alpha3_1  |   net/http.serverHandler.ServeHTTP()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:2887 +0xca
alpha3_1  |   net/http.(*conn).serve()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:1952 +0x87d
alpha3_1  | 
alpha3_1  | Goroutine 340 (running) created at:
alpha3_1  |   net/http.(*Server).Serve()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:3013 +0x644
alpha3_1  |   github.com/dgraph-io/dgraph/x.startListen()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/x/server.go:71 +0x144
alpha3_1  | 
alpha3_1  | Goroutine 262 (running) created at:
alpha3_1  |   net/http.(*Server).Serve()
alpha3_1  |       /usr/lib/go-1.16/src/net/http/server.go:3013 +0x644
alpha3_1  |   github.com/dgraph-io/dgraph/x.startListen()
alpha3_1  |       /home/ngorohov/projects/go/dgraph/x/server.go:71 +0x144
alpha3_1  | ==================

Be aware, we have someone looking into your case.

Cheers.

Thanks @ngorohov for reporting this. This has been fixed by fix(pool): use write lock when getting health info by NamanJain8 · Pull Request #7963 · dgraph-io/dgraph · GitHub.

Cheers :tada:

2 Likes