Monday, August 31, 2009

End-To-End arguments in system design

This paper makes the argument that one should not focus on pushing lots of features into a network layer as many or most of those features will need to be re-implemented by the applications using them anyway. A number of examples, both theoretical and actual, are given to bolster the claim.

A good amount of time is spent on a 'careful file transfer' application that wants to do reliable transfer of files. It is pointed out that, given possible hardware and software failures at all layers of the stack, this application will have to do full checks of the received file regardless of guarantees provided by the network layer, so the network layer shouldn't bother to try too hard. The paper does acknowledge, however, that too unreliable of a network will be a performance problem.

Many systems that do replication run into similar problems and come to similar conclusions. Basically the answer seems to be, leave it up to the application, since it's the only thing that understands the semantics of the data. Coda, for example, punts resolution to the user/application level if a file is modified offline in conflicting ways.

While I agree with the major points this paper makes, one must remember that error recovery code is very difficult to write, and that there is much to be gained by pushing difficult code into lower levels, allowing it to be written by 'expert' programmers, easing application development. Many developers are willing to give up a bit of performance *and* correctness in exchange for it being easy to write their program. Thus TCP remains far more popular than UDP, even though the arguments in this paper suggest most applications should use UDP.

Being a systems person I found myself wanting some sort of performance test showing just how much more we can squeeze out of a system by doing only end-to-end error checking. The paper felt a bit hand-wavy as is, even though the arguments are compelling.

The recognition of the need to have application specific semantics involved in error recovery is a very useful one, however, and it's application to the network domain is interesting. As such I would argue that this paper should be kept in the syllabus.

2 comments:

  1. I agree that reduced complexity is one reason why developers might appreciate more intelligence from the network layer. Perhaps the way to solve the complexity problem is by further layering -- that is, by introducing libraries and higher-level protocols that run on top of the network layer. Given that, I think the end-to-end principle is really about how to decide in which layer to place a given piece of functionality: toward the top and closer to the application, or toward the bottom and closer to the hardware?

    I think the paper is definitely not arguing that "most applications should use UDP" rather than TCP. I think the E2E principle would say that there are many different kinds of delivery guarantees; the exact requirements depend on the application. In some (most) cases, the application's reliable delivery requirements can be implemented by layering on top of TCP -- but that isn't always the case. Therefore the network shouldn't require TCP-like reliable delivery behavior as a basic primitive, because that would constitute prematurely optimizing the network layer for a certain class of applications.

    ReplyDelete
  2. Cool! Someone commented on your posting!

    Both of today's papers are pretty philosophical and non-quantitative. It is an interesting question whether these papers would ever be published today -- and the answer is a highly likely no. Nevertheless, it is interesting for the context of 2009 to look back at the philosophies and decision tradeoffs as articulated at the time.

    ReplyDelete