Predicate mutations while moving predicates between groups

In the docs on cluster setup it says:

Are there any plans to remove this limitation by allowing mutations while predicates are being moved to a different group?

Yes. On the roadmap, this is called “hot tablet moves”: Dgraph Product Roadmap 2021.

1 Like

@dmai - Perfect, thanks.

Is that likely to be available in 21.07, or more likely 21.11?

@eugaia , the change is already there in master opt(predMove): hot tablet move by NamanJain8 · Pull Request #7703 · dgraph-io/dgraph · GitHub. So should be out in 21.07 :tada:

2 Likes

@Naman - Awesome. Thanks!

1 Like

@Naman - Looking through the above pull, I had some thoughts.

Rather than just doing one pass of phase 1 and one of phase 2, could the downtime due to phase 2 be minimized by doing possibly multiple cycles of phase 1 bringing it up to present (or almost present) before doing a final phase 2 that blocks only for a few seconds at most?

What I mean is something like this:

  1. Phase 1 as current
  2. At end of phase 1, check to see if any commits have happened during (1) or bring the checks up to the current (or very recent) timestamp, restarting phase 1 with a later timestamp if necessary (i.e. still non-blocking for writes)
  3. Repeat 1+2 as many times as necessary to bring it up to present or nearly present
  4. Go to phase 2 when it should be possible to only have a few seconds of downtime for commits at most
1 Like

@eugaia , I thought about it as well. This can be done but there is no direct way of knowing how many updates have been made to a predicate since timestamp ts.
One option could be to iterate Phase 1 for some constant number of times (say 3) and then move to phase 2.
This would add some complexity and maybe edge cases to the solution. I don’t think it is worth the complexity.

1 Like

@Naman - with your current implementation, assuming a dataset that takes say 30 mins to transfer to the new node, and frequent updates during the tablet transfer process, do you have an estimate of how long the predicate will be unavailable for writing during phase 2? Are we talking seconds or potentially several minutes?

Instead of this method, have you thought about writing the same data to more than one tablet at once while writes are coming in? What I mean is something like this:

  1. Before tablet move, writes are made to the orginal tablet t1 (and its replica set) only
  2. When tablet move is triggered, copy of edges from tablet t1 to tablet t2 begins
  3. While edges are being copied from t1 to t2, any new writes on t1 are also written to t2 simultaneously (while the copying process is still being done)
  4. While the edges are being copied from t1 to t2, reads are still only done from t1 (as now)
  5. When all the edges have been copied from t1 to t2, reads and writes are moved to t2 and t1 can be dropped

Assuming that writes to t2 can be processed both from the copying process and from the new writes simultaneously, that should guarantee that all new writes during the copying process show up on the new tablet, shouldn’t it?

On the off chance writing simultaneously to t1 and t2 might cause race conditions in the data that’s stored, you could also write to a third, temporary tablet t3 during (3), such that after copying all of t1 to t2, you cycle through copying all of t3 to t2 as well - which would overwrite all the new writes. Then when the switch is made over to using t2 instead of t1, t3 is dropped as well.

I appreciate that this adds a little more complexity, but having downtime on writes because of tablet moves is a big deal for some people that have high availability use cases.

Thanks @eugaia. I discussed the changes internally.
The first approach you suggested is doable and we can do that. The second approach is also great but will require a lot of rework of the main pipeline of how mutations are processed.
I am marking it as accepted and creating a ticket for the same for further tracking.

1 Like

@Naman - Thank you. I appreciate you taking the time to consider these ideas.