Deadlock on Query during parallel tests

myo · January 15, 2020, 6:37am

Hello,

Since Dgraph 1.1.1 and also with Dgraph 1.2.0rc1, several go routines of my parallel tests regularly get stuck at txn.Query()
At the deepest of the stack, they stuck at grpc/internal/transport.(*Stream).waitOnHeader()

I have no clue why it happens, and it is hard to give you a simple example to reproduce the lock.
What could cause that? How can I get debug info about it?

Thanks.

myo · January 15, 2020, 11:45am

Another info: Some other routines are locked on a short (4sec) time.Sleep()
With my Goland IDE, if I pause the debugger, then put a breakpoint just after one of these time.Sleep(), then resume execution, then it never reaches the breakpoint!
How could that be possible?

amanmangal · January 15, 2020, 5:37pm

Could you share the logs for alpha(s), that’d help.

amanmangal · January 15, 2020, 6:27pm

Thank you for the logs, I am looking into this issue. It is possible that alpha is stuck and not responding to queries. I will keep you posted with the progress.

amanmangal · January 15, 2020, 6:36pm

Are you using GraphQL endpoint by any chance?
I see an error log though that seems unrelated. We are looking into that in any case.
The logs are of around 1 min. I can see that an alter request is issued which would block normal reads. Are you still not able to query anything on the cluster?

myo · January 15, 2020, 6:46pm

I don’t use the graphql endpoint.
The alter request is because I set the schema at the beginning, I suppose.

My program is stuck when it happens. But I could try querying from ratel. I don’t have a cluster, only one instance.

I won’t be able to do any tests shortly as I’m out and it’s night here.

myo · January 16, 2020, 6:18am

I confirm I can run ratel queries and get responses during the deadlock.

amanmangal · January 16, 2020, 7:52am

That means, alpha is doing fine. Either something is off with the network, or there is an issue in your code. Happy to look at your code if it is possible for you to share it.

myo · January 16, 2020, 8:11am

When some txn.Query() are locked, if I run their queries by hand on ratel it works.
But how could my code be responsible of txn.Query() being stuck?

myo · January 17, 2020, 12:16pm

When the deadlock appears, alpha logs lots of

13436 admin.go:214] Error processing GraphQL schema: input:1: Unexpected Name "_STAR_ALL"

multiple times per seconds.
What does it mean?

amanmangal · January 17, 2020, 12:49pm

That’s not related and I think this warning has been fixed in the latest master.

amanmangal · January 20, 2020, 1:49pm

Could you try with latest master and let us know whether you are still seeing this issue. Thanks

myo · January 20, 2020, 7:54pm

Just tried with master but still get stuck.

myo · January 21, 2020, 1:26pm

Maybe dGraph is not involved as when it happens EVERY routines are stuck.
If I set a debugger breakpoint at the next instruction of any routine, it is never reached.
I have one CPU at 100%.

I tried to run dgraph alpha source code as a project in my IDE, run in debug mode and then pause the program when it happens, and the result is the CPU activity stopping. So it looks like related to my client AND dGraph alpha.
Doing the same with dGraph zero didn’t not paused the CPU activity.

myo · January 22, 2020, 12:02pm

I narrowed down the problem to an upsert request that sometimes causes the lock:

Query:
{
	no_games as var(func:type(Word)) @filter(NOT has(~word))

	few_games as var(func:le(count(~word), 20)) {
		games1 as count(~word)
		~word {
			overlap as count(word_overlap)
			date as count(finished_at)
		}
		overlaps1 as sum(val(overlap))
		dates as sum(val(date))
		overlaps2 as math(cond(dates == 0, 10, cond(overlaps1 == 0, 0.5, overlaps1)))
		games2 as math(cond(games1 == 0, 0.1, 1.0*games1))
		few_games_score as math(overlaps2/(games2*games2))
	}

	old_scores as var(func:ge(recent_weight, 0)) @filter(NOT uid(no_games, few_games))
}
Mutations: [
	set_json:"{\"uid\":\"uid(no_games)\",\"recent_weight\":\"100\"}"
	set_json:"{\"uid\":\"uid(few_games)\",\"recent_weight\":\"val(few_games_score)\"}"
	delete_json:"{\"uid\":\"uid(old_scores)\",\"recent_weight\":null}"
]

myo · January 22, 2020, 2:09pm

Removing second set_json prevents the lock: {"uid":"uid(few_games)","recent_weight":"val(few_games_score)"}"

myo · January 22, 2020, 4:11pm

Note that the lock doesn’t happen on that request but a few seconds later even if it seems to be that request that implies the lock.

myo · January 22, 2020, 4:41pm

I also tried to split this upsert into individual query, mutation and mutate delete operations, in a single transaction or in individual transactions but without success.

myo · January 23, 2020, 6:54am

New info: The lock is related to the recent_weight predicate which is defined like that:

recent_weight:	 		float			@index(float)			.

And when I query it with an orderdesc in a big query including:

word_0_0 as var(func: ge(recent_weight, 0), first: 8, orderdesc: recent_weight) @filter(((NOT uid("0x51a9","0x51ad","0x3580b8","0x358527")) AND NOT uid("0x51bc","0x51af")))

If I remove the orderdesc, I don’t have any more lock.

amanmangal · January 23, 2020, 8:37am

Is it possible for you to provide a sequence of steps to reproduce the issue along with example dataset if possible? I tried with your query and a simple dataset. Worked fine for me here. Happy to look into the issue further then. Thanks for trying to figure it out.

Topic		Replies	Views
Concurrent query with ReadOnlyTxn without initial table serving may be blocked Dgraph dgraph , status:accepted , kind:bug , area:crash , area:operations	4	676	June 27, 2019
Certain GraphQL Queries Hang GraphQL	3	1123	May 13, 2021
Active Mutations stuck forever Dgraph	6	1224	July 8, 2020
Dgraph Alpha Node unresponsive Dgraph	10	1091	September 10, 2022
Start alpha take up nearly an hour and can not query when exist pending proposals or pending query Users	16	1806	February 10, 2019

Deadlock on Query during parallel tests

Related topics