My references to wall-clock time are for the purpose of describing how we meet the abstract requirements of linearizability, not something which is used in the implementation itself.
I am indeed assuming that we want to read from the group, with a query from outside the group. The reason is, queries can generally start from outside the group. If we get linearizability by using ReadIndex from inside the group, we have to make an extra network round-trip – first from outside the group to a group member, then inside.
Note that with our vendored etcd version, using ReadIndex requires CheckQuorum==true in the etcd config. ReadIndex uses leases, and given the way etcd uses an external ticker to keep time, it might not be a reliable way to operate. This is not true in a later etcd version.
Executing all reads from the leader would be perfectly fine if we created finer-grained raft groups. So, in the short run, while we’re sharding by predicate, we might want to avoid that. But in the long run, we might prefer fine-grained sharding, and to optimistically read from the leader. Something to keep in mind.
If we update our etcd version, we can get a “safe” non-lease-based ReadIndex implementation. See https://github.com/coreos/etcd/commit/710b14ce56c4e4c32a5e38229f1325365e7d0988 in particular, to see what we get. Then we have two options. The lease-based ReadIndex is the same as before – the follower sends a message to the leader, and the leader sends a response. For the “safe” ReadIndex, a message is sent to the leader, and the leader then broadcasts a message (a heartbeat) to all nodes (that it’s aware of). A majority respond, and then the leader responds to the follower.
So, when querying from inside the raft group, we have the options of 1 round trip with the unsafe lease-based ReadIndex, or 2 round trips with the safe ReadIndex implementation.
When querying from outside the raft group, we could use ReadIndex in the following manner:
-1. Send a message to the leader telling it to reply (1st time) with its lastIndex right away, but also to run ReadIndex (with the “safe” option), and then later reply (2nd time) confirming the value.
-2. Once we get the 1st reply, perform the read (optimistically) on whatever replica we want (telling it to wait for that watermark).
-3. Wait for the 2nd reply from the leader confirming that the read was valid.
This lets us use 2 round trips (typically) with the safe ReadIndex instead of 3. We could do the same from within the group, just to start the actual read operation faster.
Regardless, we could just say, hey, we’re going to accept 3 round-trips from outside the group, and 2 from within.
In any case, we will also need to change the implementation of Watermark so that we can wait for a watermark without using time.Sleep.