Monday, September 28, 2009

Safe and Effective Fine-grained TCP Retransmission for Datacenter Communication

Summary
This paper looks at the problem of TCP Incast. The paper states this occurs when three conditions are met in a network system:
  1. High-bandwidth, low-latency networks with switches that have small buffers
  2. Clients that have barrier-synchronized requests (that is, requests that need to wait for all participants to respond to make progress)
  3. Responses to those requests that don't return very much data.
What happens in these situations is that all the participants respond simultaneously and overflow the (small) switch buffers. This causes losses which the sender waits at least the TCP minimum retransmission timeout (RTOmin) to retransmit. Since the client is waiting for all requests, and the network is much faster than RTOmin, the link goes idle, often achieving only a few percent of it's capacity.

The paper runs a simulation of a system that fetches 1MB blocks as well as testing it on a real cluster. They show that the simulation fairly closely matches reality, modulo some real world jitter not present in the simulation.

The approach taken to solving the problem is reduce the RTOminto a much smaller value. The current Linux TCP stack uses jiffy timers, which turns out to mean the lowest practical RTOminis 5ms. It turns out that much lower times on the order of 5µs is preferable.

To make very short timeouts possible the authors modify the TCP stack to use high resolution hardware timers and show that the incast problem is greatly reduced by using such short timeouts. In fact, goodput remains fairly constant, even with 45 clients, when the RTOmin is allowed to go as small as possible.

Finally the authors look at the effects reducing RTOmin could have in the wide area and conclude that it should not effect performance significantly. They also note that there is not much overhead in maintaining such short timers, as they only get triggered when packets are actually lost.

Comments
I thought this was a good paper, thorough in their description and analysis of the problem. The actual technical work was not too much, but that is sort of the point. We have a real problem here, and it's actually not that hard to solve it. I liked that they ran simulations, but also real experiments to validate them. The discussion of the wider impact of short timers was also good, as this is a real concern in implementing their solution.

It would have been interesting to see the experiments run on more different workloads however, as it's not clear the same patterns would be seen. Also, how serious is incast if you have barrier-synchronization but only need to wait for (say) half of your nodes to respond, instead of all of them?

No comments:

Post a Comment