After bulk loading data into 3 shards on 3 servers, when we try to start dgraph, zero logs that it is trying to rebalance the Name predicate, which is on associated with every node in the graph. However, that move times out each time with a context deadline exceeded error. This appears to be hardcoded at 20 minutes from looking at the source in tablet.go:
predicateMoveTimeout = 20 * time.Minute
What version of Dgraph are you using?
v20.03.4
Have you tried reproducing the issue with the latest release?
master branch also has 20m timeout.
What is the hardware spec (RAM, OS)?
CentOS Linux release 7.8.2003
Steps to reproduce the issue (command/config used to run Dgraph).
Start dgraph cluster (3 alpha nodes, 1 zero) after bulk loading data and it cannot rebalance.
Expected behaviour and actual result.
Expected behavior is to rebalance successfully.
Actual behavior is timeout after 20m and alpha nodes are unable to rebalance predicates.
Logs:
I0731 12:38:43.646333 1 tablet.go:108] Going to move predicate: [Name], size: [43 GB] from group 1 to 2
I0731 12:38:43.646608 1 tablet.go:135] Starting move: predicate:"Name" source_gid:1 dest_gid:2 txn_ts:82001
E0731 12:58:43.645868 1 tablet.go:70] while calling MovePredicate: rpc error: code = DeadlineExceeded desc = context deadline exceeded
It seems fine. But separately we should investigate why the rebalancing after the bulk load is needed at all. The bulk loader does something similar to rebalancing so it shouldn’t be needed.
Yes, 2 hours is enough time to rebalance for our graph. I guess it could be a command line option since there are probably graphs larger than ours out there…
The specific predicate that it rebalances is Name, which is shared across all types. So every node in the graph has one. I dont know if that helps with why it is trying to rebalance in the first place.