Monday, August 31, 2009

End-To-End arguments in system design

This paper makes the argument that one should not focus on pushing lots of features into a network layer as many or most of those features will need to be re-implemented by the applications using them anyway. A number of examples, both theoretical and actual, are given to bolster the claim.

A good amount of time is spent on a 'careful file transfer' application that wants to do reliable transfer of files. It is pointed out that, given possible hardware and software failures at all layers of the stack, this application will have to do full checks of the received file regardless of guarantees provided by the network layer, so the network layer shouldn't bother to try too hard. The paper does acknowledge, however, that too unreliable of a network will be a performance problem.

Many systems that do replication run into similar problems and come to similar conclusions. Basically the answer seems to be, leave it up to the application, since it's the only thing that understands the semantics of the data. Coda, for example, punts resolution to the user/application level if a file is modified offline in conflicting ways.

While I agree with the major points this paper makes, one must remember that error recovery code is very difficult to write, and that there is much to be gained by pushing difficult code into lower levels, allowing it to be written by 'expert' programmers, easing application development. Many developers are willing to give up a bit of performance *and* correctness in exchange for it being easy to write their program. Thus TCP remains far more popular than UDP, even though the arguments in this paper suggest most applications should use UDP.

Being a systems person I found myself wanting some sort of performance test showing just how much more we can squeeze out of a system by doing only end-to-end error checking. The paper felt a bit hand-wavy as is, even though the arguments are compelling.

The recognition of the need to have application specific semantics involved in error recovery is a very useful one, however, and it's application to the network domain is interesting. As such I would argue that this paper should be kept in the syllabus.