After bulk load, dgraph times out during rebalance

jgoodall · July 31, 2020, 1:46pm

Report a Dgraph Bug

After bulk loading data into 3 shards on 3 servers, when we try to start dgraph, zero logs that it is trying to rebalance the Name predicate, which is on associated with every node in the graph. However, that move times out each time with a context deadline exceeded error. This appears to be hardcoded at 20 minutes from looking at the source in tablet.go:

predicateMoveTimeout = 20 * time.Minute

What version of Dgraph are you using?

v20.03.4

Have you tried reproducing the issue with the latest release?

master branch also has 20m timeout.

What is the hardware spec (RAM, OS)?

CentOS Linux release 7.8.2003

Steps to reproduce the issue (command/config used to run Dgraph).

Start dgraph cluster (3 alpha nodes, 1 zero) after bulk loading data and it cannot rebalance.

Expected behaviour and actual result.

Expected behavior is to rebalance successfully.

Actual behavior is timeout after 20m and alpha nodes are unable to rebalance predicates.

Logs:

I0731 12:38:43.646333       1 tablet.go:108] Going to move predicate: [Name], size: [43 GB] from group 1 to 2
I0731 12:38:43.646608       1 tablet.go:135] Starting move: predicate:"Name" source_gid:1 dest_gid:2 txn_ts:82001
E0731 12:58:43.645868       1 tablet.go:70] while calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

jgoodall · August 14, 2020, 4:27pm

Any update on this issue?

jgoodall · September 15, 2020, 6:29pm

I submitted a PR changing the timeout from 20 minutes to 2 hours:

https://github.com/dgraph-io/dgraph/pull/6388

chewxy · September 15, 2020, 7:08pm

Thanks for your PR. Absolutely love it when the community steps up and helps out with PRs.

@LGalatin, @vvbalaji can you check the PR to see whether it closes ticket 2221 and whether the code is suitable to be merged?

Paras · September 15, 2020, 9:58pm

Hi @jgoodall ,
does increasing the timeout allow the rebalance to succeed?

cc @martinmr , @dmai - comments?

martinmr · September 15, 2020, 10:50pm

It seems fine. But separately we should investigate why the rebalancing after the bulk load is needed at all. The bulk loader does something similar to rebalancing so it shouldn’t be needed.

jgoodall · September 16, 2020, 12:17am

Yes, 2 hours is enough time to rebalance for our graph. I guess it could be a command line option since there are probably graphs larger than ours out there…

jgoodall · September 16, 2020, 12:21am

The specific predicate that it rebalances is Name, which is shared across all types. So every node in the graph has one. I dont know if that helps with why it is trying to rebalance in the first place.

chewxy · September 16, 2020, 3:38am

Merge has been approved … will merge shortly.

Thank you once more for your PR!

Topic		Replies	Views
Zero rebalance_interval server write error predicate_move Dgraph example	7	1066	November 5, 2018
Predicate is being moved, please retry later Users	22	2134	November 30, 2017
DGraph Times Out Processing Graph Dgraph dgraph , investigate , status:accepted , area:performance	26	1127	November 13, 2019
Dgraph zero rpc timeout when moving _predicate_ between groups Dgraph	2	582	April 17, 2018
Hardcoded predicateMoveTimeout set to 2 hours Dgraph status:accepted , kind:bug , ticket:created	1	542	June 11, 2021