After bulk load, dgraph times out during rebalance

Report a Dgraph Bug

After bulk loading data into 3 shards on 3 servers, when we try to start dgraph, zero logs that it is trying to rebalance the Name predicate, which is on associated with every node in the graph. However, that move times out each time with a context deadline exceeded error. This appears to be hardcoded at 20 minutes from looking at the source in tablet.go:

predicateMoveTimeout = 20 * time.Minute

What version of Dgraph are you using?

v20.03.4

Have you tried reproducing the issue with the latest release?

master branch also has 20m timeout.

What is the hardware spec (RAM, OS)?

CentOS Linux release 7.8.2003

Steps to reproduce the issue (command/config used to run Dgraph).

Start dgraph cluster (3 alpha nodes, 1 zero) after bulk loading data and it cannot rebalance.

Expected behaviour and actual result.

Expected behavior is to rebalance successfully.

Actual behavior is timeout after 20m and alpha nodes are unable to rebalance predicates.

Logs:

I0731 12:38:43.646333       1 tablet.go:108] Going to move predicate: [Name], size: [43 GB] from group 1 to 2
I0731 12:38:43.646608       1 tablet.go:135] Starting move: predicate:"Name" source_gid:1 dest_gid:2 txn_ts:82001
E0731 12:58:43.645868       1 tablet.go:70] while calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Any update on this issue?

I submitted a PR changing the timeout from 20 minutes to 2 hours:

https://github.com/dgraph-io/dgraph/pull/6388

Thanks for your PR. Absolutely love it when the community steps up and helps out with PRs.

@LGalatin, @vvbalaji can you check the PR to see whether it closes ticket 2221 and whether the code is suitable to be merged?

Hi @jgoodall ,
does increasing the timeout allow the rebalance to succeed?

cc @martinmr , @dmai - comments?

It seems fine. But separately we should investigate why the rebalancing after the bulk load is needed at all. The bulk loader does something similar to rebalancing so it shouldn’t be needed.

Yes, 2 hours is enough time to rebalance for our graph. I guess it could be a command line option since there are probably graphs larger than ours out there…

The specific predicate that it rebalances is Name, which is shared across all types. So every node in the graph has one. I dont know if that helps with why it is trying to rebalance in the first place.

Merge has been approved … will merge shortly.

Thank you once more for your PR!