Deadlock on Query during parallel tests

Hello,

Since Dgraph 1.1.1 and also with Dgraph 1.2.0rc1, several go routines of my parallel tests regularly get stuck at txn.Query()
At the deepest of the stack, they stuck at grpc/internal/transport.(*Stream).waitOnHeader()

I have no clue why it happens, and it is hard to give you a simple example to reproduce the lock.
What could cause that? How can I get debug info about it?

Thanks.

Another info: Some other routines are locked on a short (4sec) time.Sleep()
With my Goland IDE, if I pause the debugger, then put a breakpoint just after one of these time.Sleep(), then resume execution, then it never reaches the breakpoint!
How could that be possible?

Could you share the logs for alpha(s), that’d help.

Thank you for the logs, I am looking into this issue. It is possible that alpha is stuck and not responding to queries. I will keep you posted with the progress.

1 Like

Are you using GraphQL endpoint by any chance?
I see an error log though that seems unrelated. We are looking into that in any case.
The logs are of around 1 min. I can see that an alter request is issued which would block normal reads. Are you still not able to query anything on the cluster?

I don’t use the graphql endpoint.
The alter request is because I set the schema at the beginning, I suppose.

My program is stuck when it happens. But I could try querying from ratel. I don’t have a cluster, only one instance.

I won’t be able to do any tests shortly as I’m out and it’s night here.

I confirm I can run ratel queries and get responses during the deadlock.

That means, alpha is doing fine. Either something is off with the network, or there is an issue in your code. Happy to look at your code if it is possible for you to share it.

When some txn.Query() are locked, if I run their queries by hand on ratel it works.
But how could my code be responsible of txn.Query() being stuck?

When the deadlock appears, alpha logs lots of

13436 admin.go:214] Error processing GraphQL schema: input:1: Unexpected Name "_STAR_ALL"

multiple times per seconds.
What does it mean?

That’s not related and I think this warning has been fixed in the latest master.

Could you try with latest master and let us know whether you are still seeing this issue. Thanks

Just tried with master but still get stuck.

Maybe dGraph is not involved as when it happens EVERY routines are stuck.
If I set a debugger breakpoint at the next instruction of any routine, it is never reached.
I have one CPU at 100%.

I tried to run dgraph alpha source code as a project in my IDE, run in debug mode and then pause the program when it happens, and the result is the CPU activity stopping. So it looks like related to my client AND dGraph alpha.
Doing the same with dGraph zero didn’t not paused the CPU activity.

I narrowed down the problem to an upsert request that sometimes causes the lock:

Query:
{
	no_games as var(func:type(Word)) @filter(NOT has(~word))

	few_games as var(func:le(count(~word), 20)) {
		games1 as count(~word)
		~word {
			overlap as count(word_overlap)
			date as count(finished_at)
		}
		overlaps1 as sum(val(overlap))
		dates as sum(val(date))
		overlaps2 as math(cond(dates == 0, 10, cond(overlaps1 == 0, 0.5, overlaps1)))
		games2 as math(cond(games1 == 0, 0.1, 1.0*games1))
		few_games_score as math(overlaps2/(games2*games2))
	}

	old_scores as var(func:ge(recent_weight, 0)) @filter(NOT uid(no_games, few_games))
}
Mutations: [
	set_json:"{\"uid\":\"uid(no_games)\",\"recent_weight\":\"100\"}"
	set_json:"{\"uid\":\"uid(few_games)\",\"recent_weight\":\"val(few_games_score)\"}"
	delete_json:"{\"uid\":\"uid(old_scores)\",\"recent_weight\":null}"
]

Removing second set_json prevents the lock: {"uid":"uid(few_games)","recent_weight":"val(few_games_score)"}"

Note that the lock doesn’t happen on that request but a few seconds later even if it seems to be that request that implies the lock.

I also tried to split this upsert into individual query, mutation and mutate delete operations, in a single transaction or in individual transactions but without success.

New info: The lock is related to the recent_weight predicate which is defined like that:

recent_weight:	 		float			@index(float)			.

And when I query it with an orderdesc in a big query including:

word_0_0 as var(func: ge(recent_weight, 0), first: 8, orderdesc: recent_weight) @filter(((NOT uid("0x51a9","0x51ad","0x3580b8","0x358527")) AND NOT uid("0x51bc","0x51af")))

If I remove the orderdesc, I don’t have any more lock.

Is it possible for you to provide a sequence of steps to reproduce the issue along with example dataset if possible? I tried with your query and a simple dataset. Worked fine for me here. Happy to look into the issue further then. Thanks for trying to figure it out.