CS268 Readings: September 2009

Wednesday, September 30, 2009

MACAW: A Media Access Protocol for Wireless LAN's

Summary
This paper develops a new link-layer access protocol for mobile wireless devices and evaluates it's performance in a number of different configurations. The devices in this case are pads and they communicate with base stations. Because the pads are mobile, and the range of each base station is quite small, this is a complex problem.

The authors first show why a carrier-sense approach fails. Specifically, if two pads are in range of a base station, but not of each other, they will both sense, find nothing, and send, but there will be interference at the base-station. This indicates that in wireless networks conflicts occur at the receiver, not the sender.

The MACA protocol attempts to deal with this problem by using a Request To Send (RTS) packet, which a receiver replies to with a Clear To Send (CTS) packet if there is no congestion at the receiver. By snooping on this traffic nodes can co-ordinate and not step on each other, and congestion control is performed at the receiver, as desired.

MACAW improves on MACA in a number of key ways:

They improve the backoff algorithm when there is a loss to not be so aggressive. In addition, to improve fairness, they share backoff information. In MACA this was a problem as one host could end up dominating a link while the others all backed off.
They introduce a more fair contention model that tries to maintain equal flows over each stream of data, rather than each sender (so a base-station sending to two pads will get half the available bandwidth instead of one third)
ACKs are added for faster retransmit since TCP's timeout is more geared to the wide area and therefore has timeouts that are too long.
They add a Data Sending (DS) packet that gets sent after a CTS message. This prevents the case where a host hears the RTS but not the CTS. This is a bit counter-intuitive as it would seem safe for a host to send in this case. However, with the bi-directional RST-CTS-DATA exchange congestion can occur at both ends of the link so the DS packet is necessary.
Finally, as a fix for the situation in which two pads are communicating with different base stations, both want to receive data, and they are close enough to interfere with each other. In this case one pad will likely dominate the channel, as the other can only send a CTS if it gets a RTS right after the other pad finishes a data transmission, which is a small window. To fix this the authors add a Request-for-Request-to-Send (RRTS) packet which can prompt a sender to re-send it's RTS. This doesn't deal with this case where one pad is totally blocking the channel however, and the second host will never even hear the initial RTS packet.

Comments
This was an interesting paper, and they propose a number of interesting a clever solutions to deal with the problems one encounters in a wireless environment. The paper does feel a bit like they thought up some situations and picked specific solutions for them, however, and since their results are all simulated versions of those situations, it's obvious they will do better. Their approach is slower than standard MACA on a non-contended link (due to the extra RTS/CTS/ACK traffic) so it's not clear MACAW is always better.

I would like to have seen more results from a general deployment, or an attempt to generalize their solutions and show why other methods wouldn't work, and/or why the situations they consider in the paper cover the majority of situations mobile clients will run into in the real-world.

An interesting paper none-the-less, and one I would keep on the reading list.

Tuesday, September 29, 2009

Understanding TCP Incast Throughput Collapse in Datacener Networks

Summary
This paper tackles the same issue as the previous CMU one, but takes a rather different methodology and comes up with some different results.

Firstly, their workload uses a fixed fragment size, rather than a fixed block size. In the CMU paper the fragment size decreases as the # of senders increases, to simulate a fixed block size. Both of these situations seem quite possible in practice.

They also look at changing parameters other than RTO, like randomizing the initial RTO value. They found that none of these approaches really seemed to help the incast problem. The hypothesis in the paper is that the switch buffers and requests "resynchronize" the machines, since the buffer overflow effects everyone at the same time.

This paper does not find that a 200µs RTO minimum increases performance as much as the CMU paper did. They argue that this is caused by delayed ACKs. Since the network RTT is 2ms the very short RTO causes unnecessary retransmits of data.

Another finding of this paper is that disabling delayed ACKs actually reduces performance as it makes TCP overflow its window, causing congestion in the network.

Finally the paper proposes a model to explain the results they see in the paper and predict goodput values for particular configurations. This explained some of the results in the paper, although it isn't clear it will generalize to other workloads.

Comments
This was an interesting paper in that it challenges some of the results from the CMU paper. In particular the CMU paper seemed to argue that reducing the RTO would be a silver bullet and, in large part, solve the incast problem. This paper suggests this might not be the case. This is a very interesting result and shows that this area needs more study.

The model struck me as a bit too simple, as I did not see how it would adjust to changing workloads, but it's a good start and, if made richer, could be a very useful tool in studying this area.

Monday, September 28, 2009

Safe and Effective Fine-grained TCP Retransmission for Datacenter Communication

Summary
This paper looks at the problem of TCP Incast. The paper states this occurs when three conditions are met in a network system:

High-bandwidth, low-latency networks with switches that have small buffers
Clients that have barrier-synchronized requests (that is, requests that need to wait for all participants to respond to make progress)
Responses to those requests that don't return very much data.

What happens in these situations is that all the participants respond simultaneously and overflow the (small) switch buffers. This causes losses which the sender waits at least the TCP minimum retransmission timeout (RTOmin) to retransmit. Since the client is waiting for all requests, and the network is much faster than RTOmin, the link goes idle, often achieving only a few percent of it's capacity.

The paper runs a simulation of a system that fetches 1MB blocks as well as testing it on a real cluster. They show that the simulation fairly closely matches reality, modulo some real world jitter not present in the simulation.

The approach taken to solving the problem is reduce the RTOminto a much smaller value. The current Linux TCP stack uses jiffy timers, which turns out to mean the lowest practical RTOminis 5ms. It turns out that much lower times on the order of 5µs is preferable.

To make very short timeouts possible the authors modify the TCP stack to use high resolution hardware timers and show that the incast problem is greatly reduced by using such short timeouts. In fact, goodput remains fairly constant, even with 45 clients, when the RTOmin is allowed to go as small as possible.

Finally the authors look at the effects reducing RTOmin could have in the wide area and conclude that it should not effect performance significantly. They also note that there is not much overhead in maintaining such short timers, as they only get triggered when packets are actually lost.

Comments
I thought this was a good paper, thorough in their description and analysis of the problem. The actual technical work was not too much, but that is sort of the point. We have a real problem here, and it's actually not that hard to solve it. I liked that they ran simulations, but also real experiments to validate them. The discussion of the wider impact of short timers was also good, as this is a real concern in implementing their solution.

It would have been interesting to see the experiments run on more different workloads however, as it's not clear the same patterns would be seen. Also, how serious is incast if you have barrier-synchronization but only need to wait for (say) half of your nodes to respond, instead of all of them?

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

Summary
VL2 is another approach to solving Ethernet's scaling problems. The authors set out three goals for the design of their new system:

Always achieve the max server-to-server capacity, limited only by the NICs on the sender and receiver
Don't allow one service to degrade the performance of another
Allow addresses to move to anywhere in the data center (i.e. a flat address space)

To achieve these goals VL2 plays a similar trick to PortLand in which they have the address for a server encode its location. Location-specific Addresses (LAs) are IP addresses and the system can use an IP-based link-state routing protocol to do efficient routing for those addresses. Application Specific Addresses (AAs) are what end nodes have, and these are hidden by the switches except for the last hop.

The key point is that servers think they are all on the same IP subnet, so they can move anywhere they want and maintain the same AA, which will be turned into a different LA to keep routing correct and efficient.

In order to minimize hotspots and ensure fairness, each flow takes a random path through the network. ECMP is used to choose the best path. ARP queries (as in SEATTLE and PortLand) are intercepted and forwarded to a directory service to minimize flooding in the network.

Comments
This paper takes a common industry tack of, 'What do we have, what do we want, and how can we change as little as possible to get there?'. While this leads to very practical designs, one must ask if there are really the 'best' solutions.

The evaluation was quite good in this paper (wow, real hardware). They show that VL2 performs well under load, and provides fair allocation to services in the face of other services changing their usage patterns.

A major concern I have is that they try to keep their directory servers strongly consistent. While this could prevent odd (and hard to debug) situations in which servers are out of sync, it means that a network partition could make the entire service un-writable, effectively preventing new nodes from entering the network. While they hand-wave a bit about being able to sacrifice response time for reliability and availability (by timing out on one server and then writing to another), this seems to conflict with their statement that the replicated state machine (the actual service that maintains the mappings) 'reliably replicates the update to every directory server'.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Summary
PortLand is an attempt (much like SEATTLE) to make Ethernet more scalable. The motivations are much the same as the SEATTLE paper (discussed below), although the approach they take is quite different.

PortLand sacrifices some flexibility in exchange for better routing. They do not support arbitrary topologies, but rather assume the network will be organized as a "fat tree". That is, end-nodes connected to some number of levels of aggregation switches, and finally connected to 'core' switches before going out to the wide area. By making this assumption about topology they are able to maintain much less routing information at each switch, and avoid many of the broadcasts that are necessary to establish a topology in traditional Ethernet.

Switches go though a quite clever algorithm in which they can determine if they are 'edge' switches (switches connected to end-nodes), 'aggregate' switches (switches that connect to edge switches and/or aggregate switches below them, but also connect to some number of switches above them), or 'core' switches (only down links internally, and connection to the wide area). This greatly reduces the administration overhead, although it is unclear how well this algorithm will deal with hardware misconfiguration (i.e. things plugged into the wrong thing).

End-nodes are assigned a 'Pseudo MAC' (PMAC) which contains not only enough information to specify the target host, but also to locate it. Switches can use a prefix of the PMAC to route packets in the fat tree and can therefore maintain much less routing state.

There is also a 'fabric manager' which is responsible for handling ARP requests. Edge switches intercept ARP broadcasts, forward them to the fabric manager, which looks up the PMAC for the request, and returns it to the switch, which turns it back into an ARP response. The fabric manager is replicated and uses only soft-state.

Piggybacked on top of the auto-configuration protocol is a keep-alive system. If a location discovery message (they also call it a keepalive) is not received from a switch in a certain amount of time it is assumed to be down. The fabric manager aids in recovering from faults by receiving notifications of failures and relaying those failures to the relevant switches, which can then update their routing tables.

Comments
I quite liked the model this paper presented. The fat tree architecture is what many data-centers already use, and it's a clever way to exploit the topology. I especially thought the auto-configuration of switch positions was nice as it removes a large amount of the configuration overhead for the network. It was unclear, however, how switches were going to know they were connected to the wide area network correctly, specifically, what their default route out should be.

The evaluation section was quite disappointing in this paper. They did not compare themselves to any other systems, nor did I feel they gave a very convincing demonstration of their own scalability. They examine link failures, but don't look at failure of the fabric manager at all. The claim is that the fact that it is replicated will solve the problem, but it's also unclear how inconsistent fabric managers would impact the ability of PortLand to route packets.

I did like their thoughts on migrated VMs, and their ability to correctly migrate TCP connections. This is going to be a useful feature going forward in my opinion.

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

Summary
This paper identifies a major problem with existing diagnostic tools. They identify faulty *hosts*, but operators almost always already know what host is faulty, and rather have some application specific problem. Isolating these application faults is what is really what operators need. Thus the authors identify a conundrum: We want our diagnostic tool to be general (application agnostic) but to give application specific information (application specific).

As a resolution the paper takes an inference based reasoning approach, looking at feature vectors of various generic and specific application metrics and comparing their values now to to historical data. A dependency graph is generated between all hosts in the system, and directed edges are introduced between components that can directly affect each other. While doing analysis the algorithm assigns high weights to edges that have corrolated state. That is, if there is an edge from node A to node B and the historical states of A and B are similar to their current states, it is likely that something in A's state is causing B's state.

Using this weighted graph the system can identify probable culprits in an anomaly. A study of a number of reports sent to Microsoft showed that they could identify the correct culprit 80% of the time, and almost always have the culprit in their top 5 list.

Comments
This paper reads a bit differently being an industry paper, and can feel a bit like a marketing pitch at times. Nevertheless the technique was quite interesting and it clearly worked since I remember NetMedic being quite popular.

I also found the overall approach interesting. I got the sense that they began by looking at problem reports, figured out what a good solution would look like, and then set about finding a method to produce that solution.

One concern I have with this approach is that it requires you to run NetMedic on all hosts in your network, and statistics gathering software is prone to making systems unstable. Norton was/is notorious for this and one worries that the cure becomes worse than the problem.

Floodless in SEATTLE: A Scaleable Ethernet Architecture for Large Enterprises

Summary
This paper describes a new network architecture, designed to ease the scalability and manageability problems that Ethernet suffers from. They present a nice overview of how Ethernet works, and its limitations, and describe their own architecture and show how it addresses Ethernet's failing.

The authors point a few key reasons Ethernet runs into trouble in large networks:

Bridge forwarding tables can grow very large because Ethernet uses a flat addressing scheme
Host information needs to be flooded, which is very expensive on a large network
Traffic must be routed on a single spanning tree (to avoid flood loops), but this means it can not always take the shortest path, and puts a high load on nodes near the root of the tree
Using broadcast for DHCP/ARP is expensive

The authors then consider using a number of small Ethernet LANs connected together with IP, which deals with a number of the scalability problems. This approach works on the efficiency front, but is a manageability nightmare since all LANs need to be kept in sync. In addition mobility is a problem since host might move between subnets.

To address these challenges the authors propose two major innovations in SEATTLE. Firstly, link-state information is replicated at every switch using an lsprotocol, which means shortest path routing can be used. Since this is information is fairly stable the approach is reasonable. Secondly, the mappings of IP address to end-host MAC addresses are stored in a one-hop DHT so they can be looked up efficiently, even with many hosts. This also improves mobility since new mappings can be published into the DHT and only needs to be changed at the host(s) responsible for storing that mapping.

There are some tricks the authors use on the DHT front. Consistent hashing is used for the keys so there will be minimal data churn when nodes enter and leave the network. Virtual hosts are used to load balance over machines of different capacities.

Finally the authors show how existing end hosts would not need to be changed at all to plug into a SEATTLE network. This is quite important for adoption in my mind.

Comments
I'm always happy to see a paper about an application of a DHT. They are elegant answers in search of problems. This was quite a nice use of the technology. The paper also does a very nice job of both describing the current state of the world and highlighting its issues.

A couple of issues I have with the paper:
- The architecture seems quite susceptible to malicious nodes. A node returning incorrect mappings could wreak havoc with a subset of end-hosts. Also, any churn in the network could make routing very unstable. Perhaps is is less of a concern given that switches are more managed pieces of hardware than end-hosts.
- The problem of a host containing the mapping I want being far away from me is a tricky one. On the one hand, SEATTLE caches aggressively, so this might be less of an issue than I think, but their solution of multi-layer DHTs seemed a bit ad-hoc and also a management problem.

Still, an interesting solution to a real problem, and a good read. As this is a very recent paper it will be interesting to see if there is any adoption of SEATTLE, or something similar.

Monday, September 14, 2009

Congestion Control for High Bandwidth-Delay Product Networks

Summary
The title of this paper is a bit misleading. What this paper really does is propose a new protocol for congestion avoidance in the network, one that would replace TCP. The authors argue that TCP performs poorly in networks with a high bandwidth-delay product (hence the title), and argue that these types of networks are becoming more common in today's Internet.

The primary change that the eXplicit Control Protocol (XCP) makes to TCP is that it adds a congestion header that provides explicit notification of congestion to endpoints rather than relying on dropped packets to tacitly signal congestion like TCP does. This modification allows routers to be more clever about their policies, and interestingly allows routing with almost no dropped packets (in their simulations, which have all well-behaved receivers).

The protocol (at the endpoints) works roughly as follows:
Sender: set my current congestion window size and round trip time estimate in the header of each packet. ACKs to these packets will have a computed requested delta in my congestion window in there, adjust accordingly.
Receiver: Simply copy the delta that's been computed along the way from the sent packet into the ACK header.

Routers along the packet's path compute the delta. Each router can overwrite the header so the bottleneck should end up being the delta in the packet. Fairness and utilization deltas are computed separately allowing different policies to be used for each one. The authors test a multiplicative increase, multiplicative decrease for utilization with an additive increase, multiplicative decrease for fairness.

There are a number of simulations that show that XCP outperforms TCP in a variety of environments.

XCP does not deal with misbehaving senders any better than TCP, however, it is much easier to identify misbehaving senders with XCP. An edge router can send a delta ACK asking for a congestion window decrease. If the sender does not respond by decreasing to the new window, it is misbehaving and can be punished appropriately.

Thoughts
This was an interesting paper. Having to use dropped packets as a signal for congestion really limits the performance of TCP, especially in situations where the packet wasn't really dropped but was only assumed dropped, as can be the case in high bandwidth-delay product networks. This will also be true in things like wireless networks where dropped packets are not due to delay but due to a crummy link.

XCP seems to solve many of TCPs weaknesses quite well. There are a few concerns I have however.

Firstly, how to deal with a misbehaving router along the path of a packet. It seems this router could overwrite the congestion header and upset the whole algorithm. TCP will have problems with misbehaving routers as well, but XCP seems like it could suffer from more insidious problems here where, say, only one source has it's congestion headers bumped up, leading to unfairness.

Secondly, per flow state is still needed for identification of misbehaving receivers. This may be unavoidable, but it is still unfortunate. Moving this computation to edge routers could alleviate the problem somewhat, as will increasing raw power in routers.

Finally, adoption seems difficult here. Although the authors suggest a way to have XCP and TCP co-exist fairly, it still requires software upgrades in a huge number of points, and only one hop on the path not supporting XCP means the whole connection needs to fall back to TCP. This path seems difficult to go down, but perhaps as bandwidth-delay products keep rising the switch will become essential.

Sunday, September 13, 2009

Random Early Detection Gateways for Congestion Avoidance

Summary
This paper describes a method for controlling congestion in a network. The technique is used on inner nodes or gateways and is designed to be easy to implement, require no per-connection state, and to be somewhat fair in that it drops packets in a manner that is roughly proportional to a connections share of the total bandwidth at the gateway.

The basic technique is to maintain the average length of the queue using a sliding window and to operate in one of three modes based on two thresholds. If the queue length is below a min threshold then all packets are forwarded. If the queue length is above a max threshold all packets are marked for congestion and/or dropped. In between the two thresholds the router randomly drops packets with a probability proportional to the queue size.

This works because the queue will be aggressively kept below the max, but since there is a sliding window for the length computation the gateway can handle short bursts of traffic. In addition, connections sending lots of packets will be more likely to have their packets marked/dropped, so the gateway is in some sense 'fair'.

The paper presents a number of simulations showing that RED does indeed control queue size, and that usage of the network is still quite high, even in the face of congestion. They discuss a number of parameters that can be set in the algorithm and their effects on the performance of the gateway. Some optimal values are suggested.

Discussion
I enjoyed this paper. The authors make a good case for the practicality of their approach (showing it's cheap to implement) and show that the algorithm works well. I think the number of parameters, and the sensitivity of the model to those parameters could be a barrier to adoption. Admins might be worried that if they tweak something a little wrong performance could become very poor.

The approach also doesn't explicitly deal with badly misbehaving clients. While those clients will be more likely to have their packets dropped, they don't get punished any extra for misbehaving.

Still, the elegance and simplicity of the approach make it quite appealing and I found this a worthwhile paper.

Wednesday, September 9, 2009

Core-Stateless Fair Queueing

Summary
This paper describes a way of approximating fair queuing (FQ) that does not require routers in the middle of the network to classify packets into a particular flow. It is called Core-Stateless Fair Queuing (CSFQ) as routers in the core of the network need to maintain no per-flow state. This reduces the processing requirements on these core routers, while still providing more fair service than first-come-first-served and coming close to the performance of the much more expensive FQ.

Packets must still be classified, but this is only done at the edges of the network, where we can (maybe) assume flows will be slower and smaller. In addition the arrival rate of the flow is computed using exponential averaging. Packets are annotated with the results of these computations and a computation of the links fair rate estimation.

At core (or inner) routers this information is used to compute a constant time algorithm which probabilistically drops packets. This is done using the annoations of the arrival rate and fair share rate. Intuitively, the more a flow's arrival rate is above its fair share rate, the more likely it is to have a packet dropped. The authors bound the divergence from optimal their strategy can experience.

The authors run a number of simulations with various queing algorithms and show that CSFQ is much better than most other algorithms, and that is very close to deficit round robin, an efficient (but more expensive than CSFQ) FQ algorithm.

The paper also contains a fairly sizable discussion on what to do about misbehaving receivers. They argue that TCP-unfriendly flows should be managed, by something as simple as restricting them all the way up to dropping all their packets. The argument for dropping unresponsive (to dropped packets) flows is that there exist applications that don't care how many packets they loose (so-called fire-hose applications) that will simply flood the network no matter what scheme is in use, unless that scheme actively punishes them for their behaviour.

Thoughts
This was a very nicely argued paper. The authors clearly state their assumptions (even admitting that FQ might not even be necessary but asserting that if it is, their work is vital) and work forward from there. While I didn't check the details of the math in the paper, but I assume it is correct, and the results they derive seem reasonable.

It is a pity this scheme requires partitioning of routers in the network into edge and core. I suspect this would be difficult. Furthermore, any core would have to disallow any edge router than was not annotating, meaning a gradual upgrade to this scheme would be difficult if not impossible.

I also liked the discussion on misbehaving senders. This seems a real problem and I think active dissuasion is a good model for routers lest we end up with a million TCP stacks out there purporting to provide faster network access.

This paper was a bit heavy going at times, but a valuable read and one I would keep on the reading list.

Analysis and Simulation of a Fair Queueing Algorithm

Summary
This paper tackles the problem of building gateways that provide fair service in the face of congestion and possibly malicious end-points. Unlike the previous papers we read, this tackles the problem in the middle of the network, not at the end-points. That is not to say that end-point flow control shouldn't also be used; the two can be used in conjunction.

The authors argue that first-come-first-served (FCFS) is not an adequate algorithm for routers since a misbehaving source that can send packets fast enough can use an arbitrarily high amount of the available bandwidth. They propose fair queuing (FQ) as an alternative. Zhang's work on the Virtual Clock gateway queueing algorithm is cited as a similar approach.

Fair Queuing is basically the idea that we track the amount of data used by each "flow" (in this paper the authors choose to consider source-destination pairs as the unit of a flow) and split the available bandwidth evenly between each flow. This requires a per flow queue, such that if a particular flow sends another unit of data before its previous one has been sent it only adds to its own queue, causing drops during congestion.

The "most fair" FQ algorithm is a bit-by-bit round-robin (BR) approach in which one bit is sent from each flow each round such that each flow gets exactly the same amount of bandwidth. Since this is clearly infeasible the authors look at doing packet level queuing. The original propsal from Nagle did not take packet size into account, allowing users to game the system by sending large packets. The authors instead basically order packets by their finishing times (always picking the packet with the soonest finishing time to be sent next) which closely approximates BR.

A simple optimization is done for 'telnet' style connections. If a connection is inactive (as a interactive connection often will be during wait times) it is given a priority boost, providing better performance for interactive sessions.

The authors evaluate FQ vs. FCFS with a variety of flow-control algorithms and show that FQ provides better server and avoids congestion.

Thoughts
FQ deals nicely with the problem of misbehaving senders. They will simply overflow their queues without effecting anyone else. It also clearly provides more fair allocations even when all senders are behaving, since there may be various different flow-control algorithms operating simultaneously and they all might send at different rates.

I had two main concerns while reading this paper. First is the added complexity it adds to routers, that would have to classify packets as belonging to a particular flow. This is non-trivial and would require considerably more powerful routers. Second is the (somewhat) arbitrary choice of source-destination pairs as the unit of flow. This can be gamed by opening connections to lots of receivers, or could unintentionally punish powerful machines that just happen to be hosting many users connecting to the same service. Still, I don't have a better solution.

Monday, September 7, 2009

Congestion Avoidance and Control

This paper describes changes to the BSD TCP stack made to improve its response to congestion in the network. The algorithms tries to maintain equilibrium, by which the authors mean a conservation of packets, or that no new packet is put into the network until an old one is taken out. Getting to this state without global knowledge is the focus of this paper.

There are a number of techniques used to achieve equilibrium. First is slow-start, in which a second "congestion window" (cwnd) is used to limit the number of packets are sent initially. When sending the min of the receivers advertised window and cwnd is used. cwnd starts at one and is increased by one due to each successful ACK. This will increase window size quickly, but avoids the problem where two hosts on fast networks have a slow link between them which will be overloaded by sending the size of the advertised window.

The second technique is estimating the variance in the round trip time of packets (gleaned from ACKs). This is necessary since as the network becomes congested variance will increase. If the sender responds to packets not getting ACKed fast enough by retransmitting it will only overload the network more. The authors develop a cheap method for estimating variance and the retransmit timer based on that method avoids unnecessary retransmits.

Finally the authors take the results from Increase/Decrease Algorithms paper and argue for additive increase and multiplicative decrease in window sizes. They solve the problem of delayed feedback by using dropped packets as an indicator of congestion rather than an explicit bit.

This was a very well argued paper, and it has the additional benefit of being an actual implementation. The reasons for the decisions made are clearly set forth and only the assumptions the authors make can really be argued with.

Nevertheless I still have some issues with this paper. The paper does not deal with misbehaving nodes which are always a problem in a heterogeneous network. (See here for example). Also, dropped packets are not always an indication of congestion. I think wireless was probably much less used when this paper was written that it is now, but I imagine there must be higher dropped packet rates on a wireless network due to corruption than on a wired network and as such the slow retransmit could mean that wireless networks are not used at their full capacity with this scheme. This paper also doesn't consider more ad-hoc networks in which each stage in a message transmission might be acked but never the whole path (i.e. in a sensor network). How well the RTT estimation would work in such a network is not clear.

Still, this seems a very fundamental paper to and should be kept on the reading list.

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

This paper sets out a formal analysis of various strategies for avoiding congestion in a network. In particular they define the point where the network is operating at its most efficient (the 'knee' of the throughput vs. load curve) and try to find algorithms that will utilize the network at that point. They also have some additional criteria, namely that all users should be using any bottleneck in the network equally (i.e. the system should be fair), no global state should be needed (the system is distributed) and that the control policy should converge to the optimal state as quickly as possible (responsiveness) and once there, should oscillate around the optimal point as little as possible (smoothness).

The model that the authors consider is one of binary feedback. That is, for each unit of time a client is informed if the network is underloaded (indicated by a congestion bit set to zero) or overload (congestion bit set to 1). It is also assumed that this feedback will be timely and not delayed, the implications of which are not considered in the paper. Given the feedback clients can either increase or decrease their utilization. The paper focuses on linear increase and decrease (i.e. adding a constant or multiplying by a constant), but does address the usefulness of non-linear approaches. Binary feedback is defended as being a simple approach that is actually possible to implement and that gives quite good results.

Through careful algebraic and graphical analysis the authors prove that the best policy should have an additive increase and multiplicative decrease, that responsiveness and smoothness are inversely related, and that non-linear approaches are not viable as they are too sensitive to system parameters.

The authors cite a number of related works on information flows and algorithms for dealing with them. Also relevant would be min/max flow algorithms in graph theory for determining the bottleneck, a problem the paper doesn't tackle.

I enjoyed this paper quite a lot. It gives a very intuitive idea of why the various parameters should be set as they are, as well as a rigorous mathematical proof of the results. The vector diagrams help to visualize the progress of the algorithms as well.

My primary concern was that the paper assumes all clients in the system will behave correctly. It seems that a client that simply floods the network would completely break this system and in a wide area network like the Internet malicious clients must be assumed to exist. There is no analysis of what effect such a client would have on the overall fairness and stability of the approaches in the paper.

It is also worrisome that prompt delivery of congestion status is assumed. In a congested network (the time when these algorithms are most needed) it seems very possible that notifications would be delayed and that these algorithms would fail. TCP's use of dropped packets as an indicator of congestion seems like a good way around this problem.

Overall a strong paper however, and one I would keep on the syllabus.

Tuesday, September 1, 2009

The Design Philosophy of the DARPA Internet Protocols

I found this a very interesting paper, primarily because it focused on a ranked list of the design goals for the IP protocol. Remembering the motivating factors in a design is an important part of evaluting the success of the design, and in deciding whether a new design is warranted or not.

The paper lays out the design goals of the IP, whose primary purpose was to unite the existing networks at the time. There could have been a whole new network designed, but it was felt that leveraging the existing networks was a better choice. Thus packet switching was selected as the best architecture. The paper also emphasizes that survival in the face of failure as the primary goal, and from there justifies the fate-sharing decision that was made in which all connection state is kept at the end points for a particular connection. Thus if an endpoint dies, the connection state is lost.

Secondly, the paper addresses the various protocols built on top of IP. TCP was originally thought to be The One protocol, but with voice as a motivating example we see that it isn't possible to provide all desired types of service within TCP. This lead to the birth of UDP, and really history seems to have proved that between TCP and UDP most applications are covered.

Given the more philosophical and 'history of the Internet' slant of this paper I'm not sure what would count as related work or background material. I suppose an understanding of how IP actually works would be useful as this isn't really covered in the paper. Work on circuit switching would be useful for understanding why packet switching was a better choice given the goals set out.

I thoroughly enjoyed this paper and felt it gave an excellent look at why IP is the way it is. The fate sharing model makes a lot of sense, especially given the fact that if an endpoint dies the connection is probably useless anyway.

I do think some of the assumptions made at the time IP was designed could be questioned, given the recent rise in distributed system, esspecially those running in totally controlled environments. What I mean is, in a distributed system we have a number of nodes, often times nodes that are replicas of each other (i.e. provide the same functionality). It would be useful to be able to inform the transport layer (just in the controlled network) of replicas and have the layer route requests to any available node that could service the request. As it is, this needs to be done at a higher layer which adds a great deal of complexity to any distributed system.

Overall a good paper, and one I would keep on the reading list.

CS268 Readings