Q&A session with Xiang and Gyu from CoreOS

Questions

What’s the role hard state plays in RAFT?

Persist hard state before you send out any messages to other nodes.

Could RAFT index be used as the equivalent of timestamp? So, we know which updates to ignore, based on the value update ts in the state.

RAFT index can be used for ordering of updates. An index is only reset in case of a complete cluster failure. Otherwise, it’s maintained.

Every time a node restarts, it somehow becomes the master. Why does that happen?

Look at the tick carefully. It’s possible that a lot of ticks are happening before the communication is established with the cluster.

What sort of memory overhead should we expect from memory storage? How often do we need to take a snapshot and discard previous entries?

Look at how many entries to store before compaction, and the average size of that entry. Also, Go can use 3x memory. ETCD sets this to 10K entries.

If we are unable to send messages due to communication breakdown, what do we do? Are they just best effort? Can we send and not worry about whether they were received or not?

Only best effort. RAFT would retry sending messages, so we could use deadline based contexts.

“It is important that no messages be sent until the latest HardState has been persisted to disk, and all Entries from previous Ready batch” -Does this mean that we just write one HardState to disk, and then parallelize between sending network calls and the rest of the writes (Snapshot and the entries), even for followers?

All hard state, snapshot and entries have to be persisted before sending out, when it comes to the follower. Only leader allows doing these in parallel. Also, not only persisted, but also saved to memory store before sending out.

Do we need to wait for network calls to finish before node.Advance()?

Nope. Network calls are done as best effort. Send and forget.

What’s the communication overhead in practice? How many RAFT groups should we be running per server?

100s of RAFT groups are being run by TiKV and Cockroach db folks. The network call overhead isn’t much, because a single connection is used to send messages, and they can be batched up.

“Marshalling messages is not thread-safe; it is important that you make sure that no new entries are persisted while marshalling.” Understand the first point about thread-safety. Don’t understand what’s that got to do with entry persistence. Also, that has to happen before sending anyway. So, I reckon this is about the leader?

TODO: Xiang to get back about this.

Batching RAFT messages to reduce sync network I/O calls. How long can we wait? What should be the message batching duration?

Several hundred milliseconds. In fact, for ETCD, it’s 10ms. The batch duration should ideally be below a heartbeat. Note that user will see that latency in their writes. So, keep it as low as possible.

How do we keep network communication down?

Batching up would achieve that. ETCD only uses a single connection between 2 peers dedicated to RAFT.

Seeing a lot of messages even in a 3-node cluster, with 2 RAFT groups. Any ideas why?

Most likely they are heartbeats.

Which messages can we drop for optimization?

So, follower has to send index 3, 4, 5, 6, it can drop 3, 4, 5 ACKs. Typically type: Append entry response, and sometimes heartbeat messages. These are optimizations, so do them carefully. Ideally only after we have implemented the system, and we feel the need to optimize.
https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1167-L1178
TIP: Search ignore in raft package.

How long should we wait before discarding a disconnected node?

In etcd, timout = 1 second. There are 2 scenarios:

  1. Report to RAFT that a node is unreachable, i.e. temporarily down. So, the leader won’t send messages aggressively.
  2. Once we know it’s down for sure, then remove from the cluster. Also, you’ll need a way to track the removed nodes, so they don’t try to join the cluster again. Note that a snapshot can end up deleting the node rejection log, so the node can get confused and try to connect to leader again and again. Having an external mechanism to check before a node starts or restarts can avoid that.

How do we run read-only queries? Based on the paper, a read requires the leader talking to all the followers to ensure that it’s the leader and waits for the state machine to advance to readIndex. Ideally, we want to avoid any RAFT related network calls during a query. What’s the best way here?

TODO: Xiang to get back about this.

For peer to peer communication, they don’t use GRPC. GRPC should support health checking.

How do we track which group is being served by which server?

TiKV uses ETCD to do that. Placement driver to track all servers. To figure out which server can participate in which raft group.
Gossip protocol is used by Cockroach db.
Initially we (Dgraph) implemented a group which every server in the cluster was supposed to be a part of. But, a RAFT group shouldn’t ideally be greater than 5 or 7 nodes. So, now we’ll consider having this special group only run in the first 5 servers, and having others communicate with this group for updates.

Should we distribute writes to all nodes in the cluster, or only send to master?

Distribute writes to followers for increased throughput at the cost of additional latency.

Participants: @mrjn, @xiang90, @pawan, @jchiu

3 Likes

Ping, @xiang90! Could you please expand on the TODOs here?

See pb.Message.Marshal (and possibly others) not thread-safe · Issue #4285 · etcd-io/etcd · GitHub. This is a proto lib internal issue.

See https://ramcloud.stanford.edu/~ongaro/thesis.pdf section 6.4. etcd/raft supports both lease based and leader based approach.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.