Take time to watch this video, and read his paper. Jeff Dean along with Sanjay Ghemawat, is a pioneer at Google, leading almost all the externally popular infrastructure technologies, i.e. Mapreduce, GFS, Bigtable, and now neural network stuff etc.
Amazing point about sending backup requests to replicas, only if the original one takes over X ms. From his benchmarks:
No backups
Avg: 33 ms
Std Dev: 1524 ms
99.9%ile: 994 ms
Backup after 10 ms
Avg: 14 ms
Std Dev: 4 ms (!!)
99.9%ile: 50 ms (!!!)
Increase in request load: < 5% extra requests
This is a huge difference! And it’s selective in the sense that you only send our requests to replicas if the original one doesn’t reply back within X ms. This is a great sweet spot between controlling the number of network calls, and hence throughput; while still improving tail latencies dramatically. Love it! We’ll have to figure out what X is for us.
It doesn’t open for me right now, but this magazine looks really interesting. Might be worth subscribing to. They have an article on Parallel Graph Processing in the current May edition.
Send request to the first replica, telling it that it’s going to send it to a second one.
2 ms later, send the request to the second one, telling it that it’s already sent to the first one.
When one of them starts processing the request, it sends a cancellation request directly to its peer.
If the peer hasn’t started processing the request, it would just cancel the request.
Otherwise, both of them process it and overall do twice the work. This should be really rare because your latencies improve due to this method.
I think this is even more powerful than the first one I pointed above. Because this is something we can throw at any cluster in any environment (GCE, Amazon, owned datacenter), and this would achieve good results.
Technically, it’s left as an exercise to the audience. He asked the audience this question in the 45 min talk at Ricon.
The answer is, if the machines are just sitting idle, which happens pretty frequently, then the 2 ms delay helps with the last case, where both process and overall do twice the work.
Update: We ended up switching to Protocol Buffers, after Flatbuffers proved to be slower. Also, Flatbuffers API is complicated and ugly; so once the benchmarks proved that FB is slower for our data, it was an easy decision to move to PB.