CS268 Readings: 2009

Tuesday, November 24, 2009

BotGraph: Large Scale Spamming Botnet Detection

Summary
A method for sending Spam that has become quite prevelant of late is having a bot running on an end host sign up for a web-based email account (hotmail, gmail etc) and then using that email to send Spam from the end host. The traditional CAPTCHA protection for new account signups is being subverted by either routing the CAPTCHA to some other 'sketchy' site (i.e. asking uses to solve CAPTCHA to download pirated software) or simply paying people a trivial amount to solve them.

This paper looks at a way of identifying these bot accounts using the signup and access patterns gleaned from the hotmail logs. They identify two main activities they consider suspicious:

A sudden increase in the number of signups from a particular IP. The authors use an exponentially-weighted moving average (EWMA) to detect these sudden increases and limit the sign-up rate making it harder for spammers to obtain new accounts.
The authors assume that there will be more accounts than machines available. Thus if multiple accounts are accessed from the same IP they stand a higher chance of being bot-created accounts. To reduce false positives (i.e. from machines changing IPs due to DHCP) they actually require the IPs to be from different ASes to count.
Finally they look at the number of messages send by each account. While this isn't used in their detection, they use it as part of the evaluation. The observation is that human accounts don't often send more than 3 emails a day (although most people I know send far more than that), and so accounts that send more are candidates for bot accounts.

The paper focuses on figuring out accounts that are accessed from IPs on different ASes. This is a huge computation and so they resort to distributed processes a la Map-Reduce to solve it. The idea is that accounts that are accessed with bot patterns will form large connected components in the user-user (vertex for each user) graph. Edge weights between accounts are assigned based on the number of shared logins across ASes. To begin they look at connected componens with edge weights of 2 and then keep increasing the threshold by 1 until no component has more than 100 members. Large connected components are considered likely to be groups of bot accounts.

The authors compare two methods for computing the above. The first partitions on client IP and looks for accounts logged into by that IP. This unfortunatly requires sending lots of weight one edges over the network which is inefficient at they won't be considered. The second partitions on account names and looks for which IPs logged into that account. This allows some pruning of weight one accounts which makes it more efficient.

Comments
I thought this was an interesting approach, but I'm not sure how difficult it would be to subvert, if spammers knew it was running. For example, it would be easy to throttle the number of account signups and slowly increase them, to defeat the EWMA detection. Secondly, it seems fairly trivial to tie accounts to the AS that created them, thereby defeating the other line of defence. Obviously spammers don't do this at the moment as this system isn't deployed, but were it to be I'm sure they would adapt quickly. Perhaps a system based on IP prefixes (to not bias against DHCP) would be more effective. There is also the problem of many computers being behind a single IP, which this analysis will miss.

Still, I thought this was an interesting approach, and a nice use of a distributed framework for graph processing (which is basically what MR was originally designed for). Perhaps more subtle patterns and metrics could be used for a more difficult to circumvent system. I would keep this paper on the sylabus.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

Summary
This paper tackles the problem of separating user generated activity (emails, web requests and clicks) from the same activities being performed by a bot infesting the computer. The authors leverage the Trusted Platform Module (TPM) to start verified and trusted code to perform the necessary operations on the end hosts for their system to work.

The basic idea of the system is that actual human users can't generate all that many events over time. For example, it is very unlikely that a human will want to send more than one email per second (although emails with huge CC lists seem like a problem for the system). With this observation, the Not-A-Bot system proposes to have software request an attest (a token signed with a private key from the TPM) when it wants to perform an action (like send an email) which will check if an action from an input device (keyboard or mouse) has happened in a threshold number of seconds (1 in the paper), and if so, grant the attest that can be included in a header of the request. A verifier on the receiving end can use this attest to be more confident that the message or request is from a human a not a bot.

For added security the content of the message is part of the attest so it is impossible for an attest to be re-used by a bot to send multiple messages. Further, even though bots could simulate input actions they could still only send at the reduced threshold rate.

The authors implement NAB using the Xen hypervisor to prevent access from the untrusted OS into the attester's address space. They strip down Xen to reduce the size of the codebase that needs to be trusted. The paper also covers (in somewhat laborious detail) the usage of the TPM to verify the code size of the attester and to obtain the private key.

To evaluate their system the authors collected activity data from a number of users at Intel and overlay honeypot collected data for spam. They show that NAB can suppress 92% of spam email, 89% of bot generated DDoS requests, and 87% of fraudulent click-throughs. This seems to assume verifiers running everywhere.

Comments
I thought this paper took an interesting approach to solving the bot problem, but in the end I remain somewhat unconvinced of their methodology.

Firstly, even if bots could only send at (say) 1 email per second, there would still be a huge amount of bot generated email flying around given the sheer number of computers infected by bots.

Secondly, incremental deployment of this system seems difficult. If only a few end hosts are using NAB then all traffic needs to be treated as 'normal' anyway, since biasing against unsigned requests would certainly have very high false positive rates. If low numbers of servers are using verifiers then there is little gain to be had by running NAB on an end host and paying the extra costs.

Finally, the whole architecture seems a bit hard to swallow. It seems a bit much to ask users to run a hypervisor under their OS for this purpose and without the hypervisor the system is open to attack. The attester's size is verified at launch, before the host OS, but if the host OS can get at the address space of the attester it could easily be replaced with something that always grants attests, making the system useless.

This wasn't a bad paper, but given that the system seems a bit weak, and the lack of a serious networking component to the paper, I would probably remove this paper.

Thursday, November 19, 2009

Cutting the Electric Bill for Internet-Scale Systems

This was an interesting paper, and a bit of a shift from the types of papers we have read so far in the class. There are two main motivators for the paper:

Internet-Scale systems are already widely replicated. This means major providers (the googles and amazons and so on) already have data centers in varied geographic locations and can route traffic for a particular application to any of a number of those data centers. In short, a request can be serviced (almost) equally by any of a number of data centers.
Power does not cost the same amount in disparate geographic locations. The paper presents a lot of data on this front and determines that there is a large hour-by-hour pricing in differential between energy markets

Given the about motivators, it makes sense for a Internet-Scale system to route requests to data centers where power is as cheap as possible. This of course assumes that keeping data centers at low load allows one to save power. This is called the energy elasticity in the paper, and the authors claim that it is about 60% at the moment (that is, systems use about 60% of the power they use under load when then are idle). This is not a great number, but this is an active area of research and one can probably safely assume that it will only get better. Systems that can actually turn off machines under low load could help to achieve perfect (0%) elasticity.

The authors use numbers from Akami to run some simulations and show that significant savings are possible. One roadblock is the 95/5 billing model where traffic is divided into 5 minute intervals and the 95th percentile is used for billing. This inhibits the amount of work the system can actually offload. By relaxing this the authors can do quite a bit better.

I thought this was an interesting paper and one I would keep on the syllabus.

Scalable Applicaiton Layer Multicast

The (at least currently correct) assumption in this paper is that we are not going to have in network multicast anytime soon. Given this, the authors develop an application level multicast scheme that tries to get as close to 'ideal' in network performance.

The approach this paper takes is to build an overlay network out of participating applications and route multicast traffic using this overlay. The protocol they develop for this purpose is called NICE.

In NICE nodes are arranged in levels. Each level is grouped into nodes that are 'near' each other (distance defined by network distance and latency), and each group is assigned a 'leader' who is the graph theoretic center of the nodes in the group. This leader also participates in the next layer up with all the leaders of the groups below and a leader of that group is picked in the same manner. In this recursive fashion a tree is built which can be used to forward packets. To multicast a packet the leader of a group simply forwards to everyone else in the group. If any of those nodes are leaders of groups below them they will take the same action in those groups until everyone has the packet.

The authors evaluate the stress (how many extra packets need to go over a link compared to an in network scheme) and stretch (how many more hops packets take vs. in network) and also compare themselves to Narada, another application level multicast scheme. They come out ahead of Narada, but do impose significant overhead vs. in network schemes. The stress of roughly 1.6 in steady state would be a non-trival performance hit. Even worse the path length is much longer in NICE than in IP Multicast (about 23 hops vs. 15) which means many more chances for bottleneck links in the wide area.

I was also concerned about the join protocol. When a new node joins it finds a cluster close to it to join. If it is the center of the cluster is becomes the new leader. In the face of churn, many new leaders could generate a large amount of control traffic and an unstable tree. There was not really any evaluation of the robustness of NICE in the face of churn.

Wednesday, November 18, 2009

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

The reliable multicast framework paper is interesting in it's development of their model. They begin with an application (wb, a multiuser whiteboard) and from there attempt to factor out what they deem traits that could be common to many applications that need multicast. Their overall approach is to keep as much functionality as they can in the client (as dictated by the end-to-end argument), doing only best effort delivery and requiring the clients to implement reliability and packet ordering enforcement.

The key contribution of this paper seems to be their mechanism for retransmitting lost packets. These losses are indicated by the clients, and the idea is to minimize the amount of extra data that needs to be sent. Clients determine a lost packet by their reception of a packet with a higher sequence number than expected. If this happens, a timer is set based on the distance of the client from the source (further away = longer timer). When this timer expires a request for retransmission is sent. If the packet is received while waiting, the timer is canceled. In this way it is expected that the first host after a lossy link will request retransmission and that it will get a prompt reply from the host just upstream of the loss. A small amount of randomness is added to the timer to try and still only use one request when two hosts equidistant from the source both miss a packet. In theory the randomness will allow one host to request retransmission first and since the response is multicast the other host will see it as well and cancel its timer.

This means hosts need to cache old requests so they can respond to retransmission requests. The amount of state needed to make this all work is not completely clear, nor is the issue addressed in the paper. Still, it's an interesting approach, and the authors run a number of simulations showing that if all state is kept their protocol performs well.

Thursday, November 12, 2009

Summary
The X-Trace paper presents a system for tracing causal paths through multi-host network architectures to help with debugging distributed applications. The authors identify the problem that modern systems are often made up of many, heterogeneous hosts and that a request must traverse a number of these hosts in its lifetime. Finding out why a given request is not performing as expected requires investigating all the hosts along that path, a difficult and time-consuming task.

X-Trace proposes to deal with this problem by adding a small amount of data in-band with each request. Each participating host must be modified to append its own data to the request, for example, an http-proxy adds some metadata, then pushes the request to the appropriate host, which adds its own data and so on. The recursive nature of this propagation assumes no particular architecture a-priori. Once the request is finished, the total metadata can be examined and used to build the resulting task tree. This tree can be used to discover what amounts of time are being taken at each component, and/or to identify any failures along the path at a particular component.

The paper then evaluates X-Trace in three example situations

A web request with recursive DNS queries. Using X-Trace the authors are able to follow the recursive DNS query and isolate faults along the way. This also allowed them to see which cache the result was coming from and know why updated addresses might be out of date.
A web hosting provider. Using X-Trace allowed for determining that a request was failing, say, in a PHP script, not in a DB query.
An overlay network. Probably the most complex scenario, X-Trace allows for distinguishing between a number of possible faults in Chord including the receiving node failing, or a middle-box failure, either a particular process crashing, or the whole host going down.

Comments
I like this paper a lot. It gives a clear outline of the problem and proposes a useful and practical solution. I've used X-Trace in a couple of projects and it most certainly helps the debugging of a distributed application. In one case we were confused about why a request was taking so long. X-Trace allowed us to pinpoint the location in some client code that was doing a slow type conversion. There were at least 4 other systems involved in the request and each one would have to have been examined without the X-Trace data.

There is some concern that requiring modification to the client will hurt adoption. While this might be true, it's hard to see how something like X-Trace could work without such modification. I also think inter-node communication libraries like thrift might help the situation. We've added X-Trace support to the thrift protocol, making the actual client modifications a one line trivial change.

I would keep this paper on the syllabus.

NetFPGA: A Tool for Network Research and Education

Summary
Falling somewhere between pure software solutions like Click and completely custom hardware, NetFPGA gives students and researches a programmable hardware platform for designing network software. This allows users to actually interact with the hardware and deal with any associated issues, without actually having to build the hardware itself.

Users program the NetFPGA remotely by uploading programs which they can then send data to and receive data from. The devices can also be linked into the Stanford network so they can carry real data. Using this platform students have built an IP routers, making it clearly a good learning tool. However, to enable more complex research the new version has an on-board CPU, (the previous version did not, and was controlled by sending special packets from a remote controller) which is being used, for example, to write a 'shunt' that allows low cost intrusion detection by only offloading packets to the detector that really need to be examined.

Comments
There doesn't seem to be a whole lot of downside to this project. I do wonder what performance is like compared to a custom hardware router and Click, but this doesn't seem like a huge issue on a research platform.

Tuesday, November 10, 2009

A Policy-aware Switching Layer for Data Centers

Summary
This paper aims to solve a problem in data-center design. Currently, if an admin wants to ensure that traffic flows through a particular set of boxes (say a firewall then a load-balancer) there isn't a standard and simple way to do this. Either the boxes have to be physically interposed, which often isn't possible, or nasty tricks like messing with link weights needs to be used. The latter doesn't enforce a flow policy either, since link failure could still allow packets to avoid flowing through a particular box.

To solve this problem the authors propose "Policy Aware Routing". The basic idea is that a central location specifies declaratively what route packets should take. This specification is compiled into a set of rules that specify the next hop for a particular type of packet, given its previous hop. Routers are modified to check this information, consult their rule table, and forward packets appropriately. Middleboxes (firewalls, load-balancers etc) are taken off the physical network path. This both prevents unnecessary traffic to those boxes, and makes the addition of new boxes fairly simple.

To allow unmodified middleboxes to plug in frames are encapsulated and decapsulated at the switches so the boxes only see standard ethernet frames.

Comments
I liked the motivation for this paper. The current situation is clearly problematic and this paper proposes an elegant solution to the problem. There main issue I see is the increase in overhead and latency that this introduces. While the numbers from the paper might be acceptable for external applications requesting data from inside the data-center, inter-center apps, especially those that are network intensive, would probably be quite hurt by the added overhead of this system.

Still, this was an interesting paper and one I would keep on the reading list.

Internet Indirection Infrastructure

Summary
This paper develops a nice DHT use, that of Internet wide indirection routing. The core problem this addresses is that the point-to-point nature of IP makes it non-ideal when we have mobile hosts, and also for multicast or anycast. The i3 architecture allows these these to be expressed quite naturally.

The basic idea of i3 is that senders send data to an identifier, rather than to a particular address. Receivers express interest in data sent by publishing their interest in a particular identifier into the i3 layer. When data is sent, i3 can quickly look up all receivers who are interested in the data and their IP address and route data to them.

This routing is achieved by maintaining routing tables at nodes in a chord ring. Ids prefix match, but in the paper it is required that all prefixes be at the same node (thus allowing fast prefix matching). When data is sent to a particular id, it is forwarded to the responsible chord node, target hosts are determined locally at that node using longest prefix matching, and then forwarded on.

To support mobility, mobile receivers simply publish their new address under the same identifier and packets will automatically be routed to them. For multicast, many hosts can publish the same prefix and the responsible forwarder node can multiplex the packet. Likewise for anycast.

For more flexibility i3 allows data to be sent to stacks of identifiers, allowing packets to go to a number of i3 nodes before their final destination. This allows, for example, load balancing multicast requests by having them go through a tree of nodes rather than overloading the single node responsible for that prefix.

Comments
I liked this paper. It is a clever use of a DHT, and the architecture is explained well. The motivations and uses for i3 are also well covered. However, I was disappointed with the evaluation. There was no actual comparison with any other system. Clearly packets have to travel further in i3 than in IP, so a comparison against standard IP routing to determine the overhead of i3 should really have been performed.

I also feel the multicast/anycast should behave a bit differently than simple packet routing. i3 could be used almost like DNS for simply unicasting packets, where end clients cache IPs and send directly, while multicast/anycast would always need to go through the i3 layer.

This was an interesting paper and I would keep it on the syllabus.

Thursday, November 5, 2009

DNS Performance and the Effectiveness of Caching

Summary
This paper looks at how well DNS performs, and at the effectiveness of caching in the system. They use three nice network traces, two from MIT, and the other from KAIST in Korea, in which the authors collected all outbound DNS queries and their responses, and also all TCP SYN, FIN and RST packets.

The main findings of the paper are:

Host names fall into a Zipfian distribution (with the popular hostnames being looked up most of the time), which means that short TTLs should be effective as names that should stay in cache will be looked up soon again. This also means that sharing a cache a cache between clients doesn't help much, as the contents will be largely the same anyway.
Caching helps latency and works well since there is not much churn in the mappings of hostnames to IPs
Resolvers re-try too often and overload the network. A whopping 63% of the DNS query packets from the traces were from queries that got no answer, which indicates a much too aggressive re-try policy

Comments
It would have been nice to have a few more traces in this study. Also, why was UDP traffic ignored? The rise of file sharing since the writing of this paper has also almost certainly skewed the distribution of traffic away from web traffic and it would be interesting to revisit some of the results in this paper with that context.

The Development of the Domain Name System

Summary
This paper describes the design of DNS, which was needed as the old hosts.txt way of doing name resolution was becoming unwieldy. A decision was made to keep the new system simple (rather than a full ACID distributed DB) as the designers felt this would ease adoption.

In DNS each name is associated with a group of resource records with known types (today I imagine everything is IP). To find these records there are resolvers and name servers. Resolvers know how to query the system to resolve a name, name servers store the actual data.

DNS forms a tree, with the root nameservers at the root(s). This tree can be split into zones, where zones are usually simple subtrees in the overall structure. The administration of the tree starts at the top (so with root admins) and then at the 'root' of a zone, admin responsibilities change.

Comments
I enjoyed this paper. The authors give a good description of the system, and the rational behind it. However, the focus on keeping things simple doesn't really seem to have done so, and DNS today is actually quite complex and error prone. Perhaps a more robust structure at design time would have slowed adoption, but given the inevitability of a distributed name service it might have been worth having a system that would be easier to administer.

I also found it interesting that in 1983 each root server only handled about 1 query per second. This is an absurdly low query rate, and it actually reflects well on the DNS design that it can still work today with much much higher query rates.

Tuesday, November 3, 2009

Looking Up Data in P2P Systems

Summary
This paper takes a very high level look at a number of different DHT systems. It first motivates the problem, that we need distributed key lookup in the face of shifting nodes all administered by different users. We want to minimize both the number of hops it takes to find a keys true location as well and minimize the amount of state each node needs to maintain. In general the approach is to have the nodes organize themselves according to some geometry and then have each node know how to route a particular key request 'closer', in the geometry, to the node that's responsible for it. There are a number of possible geometries and the paper looks at some of the most common:

- chord: forms a ring and each node is assigned a position on the ring. nodes maintain information about other nodes 1/2, 1/4, 1/8 of the way around the ring, and so on. This way a key can be found in log(n) lookups. Nodes also maintain their immediate successors on the ring so in the worst case (all the more distant nodes are down) a request can just hop along one node at a time toward its location.

pastry/tapestry/kademlia: these use tree schemes in which each node knows of another nodes with a particular prefix (0,1,00,01...). Pastry maintains a leaf set of close nodes that it can route to and it can always route as long as more than half of those nodes don't fail.

can: forms a hypercube and nodes are responsible for a region of the cube. nodes maintain information about their neighbors in the cube and route along the straight line between their location and the location of the key.

Comments
This paper was very high level. If I didn't already know how chord and can worked I don't think I would now, and I don't feel that my basic understanding of the tree based systems was much improved either. It does present a good quick motivation for the problem, and perhaps as a first introduction to the topic it is a useful paper.

The main thing about multi-hop DHTs is that they seem like an obviously useful structure but they haven't actually found all that many uses. Aside from DNS, which isn't really the sort of thing this paper is describing, I'm not away of any major multi-hop DHT deployments. One-hop DHTs, on the other hand, have proven quite useful in systems like Dynamo or SEATTLE. Perhaps the inherent uncertainty in the number of hops for routing in this style of DHT makes them unattractive for real systems.

While this wasn't a bad paper, I would think that something that took a closer look at a particular type of DHT and some applications for it might be a better fit for the class.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

The Chord paper was fun for me to re-read as it's actually a system I've implemented. Having done a few paper to actual system conversions I will say that the chord paper does an excellent jobs of laying out everything an implementor might need from good pseudo-code to detailed discussion of how the various components of a chord system should interact. The paper does make the system sound quite simple and easy to implement which is not entirely true, but the clear description certainly helps.

At a more general level chord is important as it's an approach to DHTs that is considerably simpler than the other models that where being proposed at the time. The one dimensional ring is much easier to reason about than CANs hypercube for example. The observation that efficient log(n) routing can be performed with such a simple structure is the most important contribution of this paper. The fact the current production DHTs like Cassandra and Dynamo are using ring structures (albeit with different routing algorithms) coupled with the analysis from the other paper we read this week shows points to this being the 'right' choice in the DHT space.

This paper doesn't have a lot to criticize. They present an elegant algorithm, good performance results, and analyze the system in failure/churn situations. Where I think this paper misses the mark, as did all the early DHT papers, is in assessing their target market. Early DHT work saw itself as aimed at the P2P community where it could replace gnutella or napster. This was misguided for two reasons, firstly that bittorrent was already pretty much taking over the P2P space and secondly that exact key lookups do not map well into the P2P space. People want keyword search and DHTs do not provide a simple way to make this happen. Systems like Cassandra or Dynamo show that DHTs are better suited as large data stores in cluster environments where admins can have total control of all nodes in the system and can therefore use more restrictive and efficient routing algorithms.

Tuesday, October 27, 2009

On Power-Law Relationships of the Internet Topology

Summary
This paper attempts to derive some power laws for the inter-domain topology of the Internet. These are relationships of the form y α x^a, relating some property of the graph of node connectivity to another. These relationships can be used to check the validity of models of the Internet used in simulation, as well as to predict the growth of the network.

The authors use three data sets for their study, collected by the National Laboratory for Applied Network Research, about six months apart, the first collected in November 1997. The information is gleaned from a route server from the BGP routing tables of some distributed routers connected to that server. There is also a router level graph used, to provide data at a different level.

There are a few definitions used in the laws:

rank - the rank of a node is its place in the list of nodes sorted by decreasing outdegree
frequency - the frequency of an outdegree d is the number of nodes with outdegree d

The authors derive three main power-laws:

The rank exponent - This law states that the outdegree of a node is proportional to the rank of the node to the power of a constant R. R is a single number that can characterize an entire graph. Indeed, in the sample data the R value is similar for all three inter-domain graphs. This suggests that a graph that has a quite different value of R is not a good representation of Internet topology.
The outdegree exponent - This law states that the frequency of an outdegree d is proportional to the outdegree to the power of a constant O. Again the inter-domain graphs all exhibit very similar O values, indicating a common structure.
The hop-plot exponent - This law states that the total number of pairs of nodes, within h hopes, is proportional to the number of hops to the power of a constant H. The authors argue that since the interdomain graphs all have a similar value for H, but the router graph has a different one, that H is a good way of characterizing families of graphs.

Finally the authors argue that using power-laws is a better way of characterizing graphs of the Internet than previous methods as the relationships are often skewed. For example, 85% of nodes have an outdegree less than the average in one of their graphs.

Comments
I thought this was an interesting paper. It presents a concise and simple derivation of the various laws, and does a good job explaining why their results might be useful. I'm concerned that all the laws are derived from graphs from a single source, however. This could indicate some fundamental properties of the particular connections the source saw, but not properties of the Internet as a whole. One hopes the chances of this are minimized by the collection points being distributed geographically.

Some of their derivations also relied on too little data, in my opinion. For example the hop-count metric has only four points in the interdomain graphs, which seems far too few to derive any real relationship.

Finally, given the time I'd like to look up the actual numbers for the numbers the paper predicts, but my Google skills are failing me at the moment. All in all, an interesting paper, and one I'd keep on the syllabus.

Monday, October 26, 2009

On Inferring Autonomous System Relationships in the Internet

Summary
This paper presents an overview of some BGP concepts, some formal definitions of properties of graphs of Autonomous Systems (AS), and and algorithm for heuristacally inferring relationships. These relationships are often not made public as they are business relationships, so the ability to infer them helps us gain a deeper understanding of Internet topology.

An AS graph is built from public BGP routing tables. The graph's edges are classified into one of three types:

provider-to-customer - the only directed edge, indicates a relationship in which the provider provides service to a paying customer
peer-to-peer - edges indicating peering agreements between ASs
sibling-to-sibling - edges between siblings

By splitting the graph into trees the authors are able to work out uphill and downhill paths. Uphill paths are made up of edges that are customer-to-provider or sibling-sibling edges. A downhill path a path of provider-to-customer edges or sibling-to-sibling edges. By looking at maximal paths, and noting that providers should have a higher degree in the graph, we can infer what relationships exist between ASs. Information about what ASs provide transit to other ASs can be used to discover peering relationships.

The authors test their inferred relationships on Route Views data, and verify it by asking AT&T to confirm their results. They classify 90.5% of edges as provider-to-customer, less than 1.5% as sibling-to-sibling edges, and less that 8% as peer-to-peer edges. The AT&T verification shows that the majority of their edges are classified correctly. 100% of their customer relations are correct, which is already 90.5% of all the edges they classified. However, the provider info is wanting. They infer that AT&T has one provider while in reality it has none, meaning they get 0% on that metric. The sample size isn't really big enough to tell us anything about the efficacy of their provider inference however.

Comments
While this paper is clearly quite an achievement, I did not find it all that interesting. We've already looked at BGP so that part of the paper didn't give me much, and the rest of the paper is pretty heavy theoretical going with not much of payoff.

I also found the verification of the result pretty unsatisfactory. While I realize that AT&T cannot release real data so show more detailed results, the very high level percentage correct figures left me wanting more. The 0% correct result on provider inference is a pretty glaring example of the problem with this verification method.

This felt like a paper more useful to someone deeply involved in this particular area, and less well suited to a general networking class. I would remove it from the syllabus.

Tuesday, October 20, 2009

White Space Networking with Wi-Fi like Connectivity

Summary
This paper tackles the problem of harnessing the newly opened up UHF spectrum for data transfer. This presents a number of problems not present in the 2.4Gz band:

The transmission range is bigger, so there is more spacial variation wrt a single AP
Exiting users of the spectrum fragment the space making the problem of picking a channel for the AP more difficult.
Because the FCC states that devices must give up a channel to incumbents (like wireless mics), and because these devices and come and go, systems must be able to quickly switch channels and do so automatically and without interfering with the incumbent device.

The paper presents two main ideas, WhiteFi, which is a WiFi like system that runs in UHF white spaces which uses SIFT, which is a fast way of locating WhiteFi signals in the entire UHF band.

The idea behind sift is to do analysis on the signal itself, and not to do a FFT first which is expensive. By looking for blips in the signal that are temporally spaced at the right time to look like WhiteFi packets and ack they can quickly detect where APs are likely to be.

WhiteFi uses SIFT for AP discovery but also establishes a secondary backup channel. If an incumbent is detected on the main channel a client or AP will immediately switch channels and 'chirp' on the backup channel to inform peers of the change. If the chirp is lost SIFT can be used to re-acquire the AP.

Comments
This paper was certainly interesting, and the prospect of using UHF white spaces for networking is quite attractive. A few issues I had:

The authors discover that even a single packet can cause interference for wireless mics. If nodes are streaming data there is no way the detection of an incumbent plus the chirping to switch time is going to prevent a few packets still being transmitted on the old channel. This seems to violate the FCC requirement of no interference.
The analysis only tested UDP flows. As we've seen before, TCP is much more sensitive to unusual network environments and it seems important to evaluate it's performance in WhiteFi.
Given the long range in the UHF band it would have been nice to see numbers about how much more area can be covered with how many fewer APs compared to traditional WiFi as a motivator for this work.

This was still a cool paper, and gets into some of the signal processing problems we haven't talked about yet, so I would keep it on the reading list.

Monday, October 19, 2009

Interactive WiFi Connectivity For Moving Vehicles

Summary
This paper looks at a scheme for improving WiFi performance when the clients are in motion, for example cars or buses that will pass by a number of base stations. The authors develop ViFi, a protocol that achieves good performance in these situations.

The paper begins by explaining why the mobile environment is difficult for WiFi applications. Handoff, in which a client changes base stations, is the crux of the problem. Typically clients must perform a 'hard handoff', which means picking a new base station and leaving the old one. The authors evaluate a number of methods for switching but show that none perform as well as a scheme that can opportunistically use multiple base stations simultaneously.

From this observation the paper develops ViFi. The basic idea is that clients pick one base station to be an anchor in much the same way a base station is picked normally. 802.11 broadcast is used, which means no link-layer ACKs, but ViFi introduces its own ACKs. If a base station that is not the anchor overhears a packet but not the ACK for that packet it will relay the packet. The relay is performed over the 'base station backplane', which is either a mesh-network or wired, depending on the deployment.

The authors run a number of experiements in a real deployment at Microsoft, and also simulations based on data from a system in Amherst MA. These show that ViFi can perform much better than standard 'hard-handoff' schemes.

Comments
I enjoyed this paper. The idea seems a fairly obvious one, but the authors present a rigorous analysis showing why using multiple BSs is a good idea, and how much better their system can perform.

Their workload did seem a bit arbitrary however, and it would have been nice to see a more varied workload approach. They also dismiss the problem of packet re-ordering by saying it would be easy to buffer and order packets at the anchor, but we've seen that this can wreck havoc with TCPs RTT estimates and it's not clear how well this would actually work.

Still, a nice approach and a good paper that I would keep on the reading list.

Wednesday, October 14, 2009

XORs in the Air: Practical Wireless Network Coding

Summary
I really liked this paper, mostly just because it's a very clever idea. An algorithm, COPE, again taking advantage of the broadcast nature of wireless radios but this time by packing more data into a single transmission, is developed in this paper. This is done by maintaining a packet buffer at each node, and by knowing (or guessing) what packets nearby nodes have in their buffers. The can be known with certainty through 'reception reports', or by guessing based on expected transmissions given the routing metric. Once I know what my neighbors have I can look at the next packet I need to transmit. If I can find n packets such that my neighbors all have n - 1 of those packets (not necessarily the same n-1 packets though) then I can XOR those packets together to send multiple packets in the space of one 'native' packet. Any neighbors with n-1 packets can XOR again to get the native packet they don't have back out. This increases the throughput of the network without changing anything else.

Each node maintains a queue of packets to send out. This could be re-ordered to provide more XOR-ing opportunities, but this isn't done since it would create an out-of-order problem for TCP. COPE maintains two virtual queues per neighbor, one of small packets and one of large packets. When it needs to send a packet it tries to find other similarly sized packets to XOR-ing with it (since XOR-ing small packets with large ones doesn't provide as much bandwidth savings). If appropriate packets are found, they XOR-ed in and sent.

Since the standard 802.11 MAC broadcast doesn't do ACKs the authors also introduce pseudo-broadcast which uses 802.11 unicast but adds an XOR header with all intended nodes so that any overhearing node can figure out if it was also an intended recipient. The packet is sent until the primary recipient ACKs the packet.

The authors give proofs of the bandwidth gain that can be achieved for certain network topologies. Most importantly they show that in the presence of opportunistic listening the gain is unbounded. This explains the 3x-4x increase in performance they see when running UDP tests on Roofnet. Interestingly they don't see much of a performance boost when running TCP tests. They believe this is because TCP backs off too much when collisions occur, and that collisions are common due to hidden terminals. They validate this by showing that TCP performs well with COPE when there are no hidden terminals.

Comments
I thought this was a well argued paper. It gives a good overview of a clever design, proves some interesting theoretical properties of the design, and then runs actual experiments on an implementation.

A few gripes:

I didn't like the graphs they chose to use, rather difficult to read and not particularly illuminating. Bar charts of throughput feel like they would have been much more useful.

The hidden terminal problem feels a lot like Incast to me. Would reducing the RTOmin be an effective way to deal with this problem? RTTs are probably higher in the wireless environment, but there might be some gain to be had. It's all well and good to say, we help TCP if there are no hidden terminals, but this will always be a problem in the real world.

ExOR: Opportunistic Multi-Hop Routing for Wireless Networks

Summary
The ExOR paper is an approach to leverage the broadcast nature of wireless radios to achieve better performance. For each hop nodes compute what other nodes they think should forward this packet in a priority order. The priority is based on a similar metric to ETX, that is, the number of transmissions from that node to the receiver, including retransmissions. When nodes receive a packet, they look at the list of nodes that are supposed to be the next transmitter of the packet. If they are in the list they schedule it for transmission. Each node waits the amount of time it expects higher priority nodes to take to transmit their packets (packets are batched for better performance), and then, if it hasn't heard the packet(s) already transmitted, sends its batch. The intuition here is that, all nodes in the forwarder list should be 'closer' to the destination, so they should all be able to overhear each other if they heard the previous hop.

A map of successfully transmitted packets is kept at each node which is updates in response to overhearing other neighbors and in receiving maps from neighbors. This acts as a gossip mechanism and keeps maps fairly synchronized between nodes and also can prevent unnecessary retransmissions when the standard MAC ACKs aren't received. When the sender discovers that 90% or more of its packets have been transmitted to the destination it sends the last bit using traditional routing. This is because sending those few packets would require running a whole batch, which is expensive for so few packets.

This method allows ExOR to take advantage of transmissions that go unexpectedly far and/or that fall unexpectedly short. A sender might compute that a distant node close to the receiver is a great hop, but packets might not always make it that far. The sender puts that node at the top of the priority list, so if packets do make it that far they will be quickly send on to the destination, but if they don't, nodes closer to the sender will transmit the packets too. Likewise, if a packet doesn't make it as far as expected, closer nodes (to the sender) will send the packets after a short delay waiting for the expected nodes to transmit.

By running experiments on Rootnet the authors show that ExOR provides an increase in throughput for most node pairs. For short hop pairs ExOR's map acknowledgement prevents retransmission and boosts throughput by almost 35%. For more distant pairs the improvement is more like a factor of 2 to four due to the many path options in ExOR.

Comments
I liked the idea in this paper. Taking advantage of the inherent broadcast nature of wireless is a great way to improve performance. The priority lists are a nice way to deal with having to have a low overhead in reaching consensus on what packets have been received where.

A few thoughts:

Given that an ETX metric is used to rank nodes in the priority list, and that ETX is based on probabilities, usually the highest priority node will receive the packet. This is not going to be the closest node to the target often (since that node will have a low ETX rating as it will have a high # of expected re-transmissions). Thus ExOR seems it will often simply route as ETX, with added delays for the batching. It would be interesting to see if simply using min hop-count as a metric would actually perform better here since the fall-back to lower priority nodes would naturally handle the major issue with using min hop-count metrics.

ExORs poor interaction with TCP is unfortunate as it prevents it from being plugged in seamlessly. The reduction in variance in transmit times due to ExOR seems like it would boost TCP performance (lower RTOs) if the window size problem could be handled.

I enjoyed this paper and would keep it on the reading list.

Thursday, October 8, 2009

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols

Summary
This paper aims to make a fair comparison between four ad-hoc wireless routing protocols. The paper describes each protocol, their simulation environment, the workload they test with, and then the results they got for each protocol.

The protocols tested are:

DSDV, which does "next hop" routing by maintaining a routing table at each node with a sequence number for each route. Nodes advertise routes with increasing sequence numbers.
TORA, uses a link reversal algorithm, wherein a Query request is sent though the network, then an Update comes back with the best route. A flow model is used where nodes have height and unreachable nodes are modeled as having a greater height the the local height.
DSR, which floods the network with Route Request packets when it needs a route. Nodes cache routes and aggressively stop Route Request packets with Route Reply packets for routes they know about. DSR also snoops on traffic to hear or compute better routes.
AODV, which is a combination of DSDV and DSR. Routes are discovered as in DSR but stored all the way along the path to the destination. Sequence numbers are used as in DSDV.

The results of the paper look at % of packets delivered, routing overhead (i.e. # of route related packets transmitted, and path optimality (i.e. how many more hops did I take than I really needed).

At a high level, DSR performs much better than most of the other protocols, with DSDV performing very poorly, especially with a lot of mobility which basically causes TORA to go into congestion collapse. All protocols perform decently well with low mobility, although TORA and AODV send many more routing related packets. DSDV does have the advantage that it sends a fairly constant amount of routing related information.

Comments
I enjoyed this paper. It gives a fairly clear description of each of the protocols, as well as issues the authors encountered while actually trying to implement each one. The analysis goes to great lengths to try and make the comparison fair, and it seems they do a decent job of that (not using TCP for example as it will vary transmission times and rates based on the current state of the network). However, any real deployment would use some transport protocol and it would be nice to see how each of these performed on TCP, the most likely choice.

It was interesting the DSR, probably the most simple of the protocols, performs so much better than all the others. Perhaps the added complexity of 'clever' routing protocols in mesh-networks simply isn't worth it.

Wednesday, October 7, 2009

A High-Throughput Path Metric for Multi-Hop Wireless Routing

Summary
This paper looks at improving routing in wireless networks that often use routes with more than one hop. Their metric, expected transmission count (ETX), tries to find the route that has the fewest number of expected transmissions on the path to the destination. In a network with no loss, this is clearly the same as a minimum hop-count algorithm, but in real networks with expected losses, this is not the case. As the paper point out, minimizing hop-count will tend to favor long hops, which in the wireless world are often weak links with poor throughput and high loss rates. This leads to poor path selection in practice.

The paper shows simulated results the indicate that, in networks where paths are typically 3 or more hops, ETX performs much better than a minimum hop-count algorithm.

The ETX metric considers expected losses in both the forward and backward direction. If df is the probabilty that a packet is sucesfully transmitted in the forward direction and dr in the reverse, then ETX = 1 / (df x dr). The delivery ratios are computed using probe packets.

The authors implement ETX in the DSDV and DSR routing schemes. Their results show that it works well in both cases with some minor modifications.

Finally the authors point out that their metrics are not perfect as the probe packets are always the same size (and smaller than typical data packets) but links might have different characteristics when the packets are of different sizes. It seems like this could be solved by including metrics from actual packet transmissions into the algorithm, and also by varying the probe packet size.

Thoughts
I liked this paper. It gives a clear description of the problem with minimum hop-count routing, and nicely explains their new metric.

A few issues I had:

The graph style for the throughput measurements is terrible and makes extracting data from the graph difficult. This is mostly just a formatting gripe however.
There is a 'probe window' in which a certain number of probes are supposed to have been sent. The delivery ratios are calculated using this expected value and the actual # of received probes. This requires all nodes in the network to have the same configuration for window size and probe frequency and could make incremental upgrade difficult.
Probe packets contains the # of probes received from all it's neighbors. In a dense network this could make the probe packets quite big and hurt overall throughput.
DSR with failure feedback does almost as well as ETX. Does this indicate that really all we care about is failed links and we don't need to do all the extra work ETX is doing?

Still, I enjoyed this paper and think it's worth keeping on the reading list.

Tuesday, October 6, 2009

Achitecture and Evaluation of an Unplanned 802.11b Mesh Network

Summary
This paper looks at a deployment of Roofnet, an unplanned mesh network in an urban environment. The basic architecture is that the nodes connect to each other and use a shortest path routing algorithm to find paths to gateway nodes, that is, those small percentage of Rootnet nodes that are also connected to the Internet.

The nodes maintain a database of link metrics and do Dynamic Source Routing, in which each TCP session computes the best route using Dijkstra's algorithm over the link DB. The metric that gets used is the link throughput in conjunction with the probability that packets will get through. This sometimes required putting the WiFi cards into high-loss but also high-throughput modes, which required setting a custom policy on the wireless NIC. To discover the link metrics nodes can flood the network and listen for replies, or snoop on other nodes' floods.

The paper evaluates the performance and the robustness of Roofnet, noting that they get mean throughput of 627 kbit/sec. They also look at simulated results of handling failures, noting that most nodes choose a wide variety of neighbors for their first hop so failure of a single neighbor shouldn't effect performance too much. They do find, however, the throughput will drop significantly as there is often only one or two 'fast' neighbors.

Thoughts
I thought this was an interesting paper. They manage to get a working network set up with minimal work by the volunteers and it actually appears to work. They get decent throughput (although the median isn't as good as the mean and might have been a better metric).

On major concern I have is that the paper does not seem to take the performance of the gateway's connection to the WAN into account. It's all well and good if I can get to my gateway at 600kbps, but if that gateway is overloaded on it's WAN routes I won't see good performance overall. This does not seem to be captured by anything in their algorithm and should really be addressed.

Still, an interesting paper and one well worth reading.

Modeling Wireless LInks for Transport Protocols

Summary:
This paper looks at a number of reasons that current models of wireless networks, and why they are inadequate. They show that modeling a duplex connection gives quite different results than one way traffic, which is the model often used in simulations.

The paper also highlights a number of features of wireless networks, how they effect the performance of transport protocol, what types of links they are present in, and how to model them. These features are:

Error losses and corruption. These losses can make TCP think there is congestion when there isn't, leading to poor performance. While FEC and link-layer retransmission ease this somewhat, there can still be delays and re-ordering due to limited retransmit buffers at the end-points of the wireless links. This can be modeled by dropping packets, but we need to consider whether our losses are correlated or not.

Delay Variation. Link delay can change suddenly in a wireless world due to mobility, or scheduling at the wireless end-points. If the TCP connection assumes this is due to congestion (i.e. it's using delay based congestion-control) then it will back off sending unnecessarily.

Packet Reordering - This can be caused by link-layer recovery mechanisms and can cause retransmission from the end-host if it causes enough of a delay in delivery.

On-Demand Resource Allocation - In cell networks a channel needs to be allocated if a client hasn't send any data for a threshold amount of time. This can cause delays. The authors show how to model this by introducing a delay when a packet is queued on an empty queue that has been empty longer than the threshold time.

Bandwidth Variation - Different channels can have different bandwidths and clients might be moved from channel to channel. This can cause spurious timeouts in TCP if the RTT suddenly goes up.

Asymmetry in Bandwidth and Latency: With an asymmetric link TCP can get congestion on the ACKs. This can easily be modeled by setting the right parameters.

The paper also looks at queue management and how drop-tail queuing can lead to delay and problems due to spurious retransmission timeouts. They argue that drop-tail modeling is okay for cell and WLAN links, but that RED is perhaps a better model for satellite links and for future WLAN links. Finally the paper looks at ways transport protocols could handle the various problems modeled in the paper.

Thoughts:
I can't say I loved this paper. It felt a bit like a laundry list of wireless link issues, which might be very useful to designers but wasn't particularly illuminating from a theoretical perspective. Some of the experiments showing the real problems with past models was interesting.

I could imagine a very interesting follow-up to this paper in which solutions to the various issues they have identified here are proposed and evaluated. This is an important starting point toward that paper.

Thursday, October 1, 2009

A Comparison of Mechanisms for Improving TCP Performance over Wireless Links

Summary
This paper looks at a number of approaches for improving the performance of TCP over wireless links. TCP's assumption that a lost packet indicates congestion is not particularly valid in the wireless environment where packet loss is more often due interference of some sort. This leads to poor performance as TCP needs to wait for duplicate acks to retransmit, has long timeouts (200ms), and will reduce its congestion window even though it's not necessary.

The paper looks at three broadly different approaches to improving the performance of TCP over wireless links.

Link-layer protocols: The tools here are forward error correction and/or automatic repeat requests. The paper argues that the advantage of this approach is that it layers nicely in the network stack, operating independently of higher-level protocols and thereby shielding those higher-level protocols from the lossy nature of the wireless link. The challenge here is to make sure the link-layer protocol plays nicely with TCP. One proposed scheme that works quite well is that the base station snoops on the TCP traffic, and buffers some number of packets. When the station sees a duplicate ACK it can suppress it, and simply retransmit the packet itself. In the experiments this protocol is shown to perform well.
Split connection protocols: In this case the connection is 'split' on either side of the wireless link and a highly tuned (for wireless) protocol is used over the wireless hop. The concern here is that the end-to-end semantics of the link are not preserved. In particular the congestion window is independent for each side of the connection and this is shown to lead to poor performance in the experimentation section. The authors conclude that split connections are not necessary for good performance of TCP over wireless links.
end-to-end protocols: This approach modifies TCP itself to deal with wireless links. Techniques include adding selective ACKs and/or Explicit Loss Notification. This approach also performs quite well in practice.

The authors run a number of experiments with PC base stations and laptop nodes. The use an exponential drop model and test a number of different loss rates. They notice that link-level protocols that are not TCP aware perform quite poorly due to the fact that the order is not preserved by the link-layer and so often times both the link-layer and the end host end up retransmitting packets. Adding TCP awareness and SMART recovery leads to much better performance in this case.

The authors also note that split-protocols are not necessary. They can be made to perform almost as well as some of the link layer protocols, but the added complexity does not really buy you anything.

Comments
I enjoyed this paper. It gives a very nice summary of the various techniques for dealing with packet loss in a wireless network and why each protocol behaves the way it does. I also thought the snoop technique was particularly clever and was gratified to see it work well in practice.

As always in a paper who's results are based on experiments, more and more diverse experiments would have been nice. In particular it would have been nice to do some long running experiments in a real environment (the paper models packet losses) and see if the results looked similar to the modeled ones. I was nice that the network was real however, and not simulated.

I also think there is more work that could be done in improving split connection protocols. Rather than doing TCP why not use some super optimized wireless protocol. The problem of violating end-to-end semantics seems difficult to get around in this case however.

Wednesday, September 30, 2009

MACAW: A Media Access Protocol for Wireless LAN's

Summary
This paper develops a new link-layer access protocol for mobile wireless devices and evaluates it's performance in a number of different configurations. The devices in this case are pads and they communicate with base stations. Because the pads are mobile, and the range of each base station is quite small, this is a complex problem.

The authors first show why a carrier-sense approach fails. Specifically, if two pads are in range of a base station, but not of each other, they will both sense, find nothing, and send, but there will be interference at the base-station. This indicates that in wireless networks conflicts occur at the receiver, not the sender.

The MACA protocol attempts to deal with this problem by using a Request To Send (RTS) packet, which a receiver replies to with a Clear To Send (CTS) packet if there is no congestion at the receiver. By snooping on this traffic nodes can co-ordinate and not step on each other, and congestion control is performed at the receiver, as desired.

MACAW improves on MACA in a number of key ways:

They improve the backoff algorithm when there is a loss to not be so aggressive. In addition, to improve fairness, they share backoff information. In MACA this was a problem as one host could end up dominating a link while the others all backed off.
They introduce a more fair contention model that tries to maintain equal flows over each stream of data, rather than each sender (so a base-station sending to two pads will get half the available bandwidth instead of one third)
ACKs are added for faster retransmit since TCP's timeout is more geared to the wide area and therefore has timeouts that are too long.
They add a Data Sending (DS) packet that gets sent after a CTS message. This prevents the case where a host hears the RTS but not the CTS. This is a bit counter-intuitive as it would seem safe for a host to send in this case. However, with the bi-directional RST-CTS-DATA exchange congestion can occur at both ends of the link so the DS packet is necessary.
Finally, as a fix for the situation in which two pads are communicating with different base stations, both want to receive data, and they are close enough to interfere with each other. In this case one pad will likely dominate the channel, as the other can only send a CTS if it gets a RTS right after the other pad finishes a data transmission, which is a small window. To fix this the authors add a Request-for-Request-to-Send (RRTS) packet which can prompt a sender to re-send it's RTS. This doesn't deal with this case where one pad is totally blocking the channel however, and the second host will never even hear the initial RTS packet.

Comments
This was an interesting paper, and they propose a number of interesting a clever solutions to deal with the problems one encounters in a wireless environment. The paper does feel a bit like they thought up some situations and picked specific solutions for them, however, and since their results are all simulated versions of those situations, it's obvious they will do better. Their approach is slower than standard MACA on a non-contended link (due to the extra RTS/CTS/ACK traffic) so it's not clear MACAW is always better.

I would like to have seen more results from a general deployment, or an attempt to generalize their solutions and show why other methods wouldn't work, and/or why the situations they consider in the paper cover the majority of situations mobile clients will run into in the real-world.

An interesting paper none-the-less, and one I would keep on the reading list.

Tuesday, September 29, 2009

Understanding TCP Incast Throughput Collapse in Datacener Networks

Summary
This paper tackles the same issue as the previous CMU one, but takes a rather different methodology and comes up with some different results.

Firstly, their workload uses a fixed fragment size, rather than a fixed block size. In the CMU paper the fragment size decreases as the # of senders increases, to simulate a fixed block size. Both of these situations seem quite possible in practice.

They also look at changing parameters other than RTO, like randomizing the initial RTO value. They found that none of these approaches really seemed to help the incast problem. The hypothesis in the paper is that the switch buffers and requests "resynchronize" the machines, since the buffer overflow effects everyone at the same time.

This paper does not find that a 200µs RTO minimum increases performance as much as the CMU paper did. They argue that this is caused by delayed ACKs. Since the network RTT is 2ms the very short RTO causes unnecessary retransmits of data.

Another finding of this paper is that disabling delayed ACKs actually reduces performance as it makes TCP overflow its window, causing congestion in the network.

Finally the paper proposes a model to explain the results they see in the paper and predict goodput values for particular configurations. This explained some of the results in the paper, although it isn't clear it will generalize to other workloads.

Comments
This was an interesting paper in that it challenges some of the results from the CMU paper. In particular the CMU paper seemed to argue that reducing the RTO would be a silver bullet and, in large part, solve the incast problem. This paper suggests this might not be the case. This is a very interesting result and shows that this area needs more study.

The model struck me as a bit too simple, as I did not see how it would adjust to changing workloads, but it's a good start and, if made richer, could be a very useful tool in studying this area.

Monday, September 28, 2009

Safe and Effective Fine-grained TCP Retransmission for Datacenter Communication

Summary
This paper looks at the problem of TCP Incast. The paper states this occurs when three conditions are met in a network system:

High-bandwidth, low-latency networks with switches that have small buffers
Clients that have barrier-synchronized requests (that is, requests that need to wait for all participants to respond to make progress)
Responses to those requests that don't return very much data.

What happens in these situations is that all the participants respond simultaneously and overflow the (small) switch buffers. This causes losses which the sender waits at least the TCP minimum retransmission timeout (RTOmin) to retransmit. Since the client is waiting for all requests, and the network is much faster than RTOmin, the link goes idle, often achieving only a few percent of it's capacity.

The paper runs a simulation of a system that fetches 1MB blocks as well as testing it on a real cluster. They show that the simulation fairly closely matches reality, modulo some real world jitter not present in the simulation.

The approach taken to solving the problem is reduce the RTOminto a much smaller value. The current Linux TCP stack uses jiffy timers, which turns out to mean the lowest practical RTOminis 5ms. It turns out that much lower times on the order of 5µs is preferable.

To make very short timeouts possible the authors modify the TCP stack to use high resolution hardware timers and show that the incast problem is greatly reduced by using such short timeouts. In fact, goodput remains fairly constant, even with 45 clients, when the RTOmin is allowed to go as small as possible.

Finally the authors look at the effects reducing RTOmin could have in the wide area and conclude that it should not effect performance significantly. They also note that there is not much overhead in maintaining such short timers, as they only get triggered when packets are actually lost.

Comments
I thought this was a good paper, thorough in their description and analysis of the problem. The actual technical work was not too much, but that is sort of the point. We have a real problem here, and it's actually not that hard to solve it. I liked that they ran simulations, but also real experiments to validate them. The discussion of the wider impact of short timers was also good, as this is a real concern in implementing their solution.

It would have been interesting to see the experiments run on more different workloads however, as it's not clear the same patterns would be seen. Also, how serious is incast if you have barrier-synchronization but only need to wait for (say) half of your nodes to respond, instead of all of them?

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

Summary
VL2 is another approach to solving Ethernet's scaling problems. The authors set out three goals for the design of their new system:

Always achieve the max server-to-server capacity, limited only by the NICs on the sender and receiver
Don't allow one service to degrade the performance of another
Allow addresses to move to anywhere in the data center (i.e. a flat address space)

To achieve these goals VL2 plays a similar trick to PortLand in which they have the address for a server encode its location. Location-specific Addresses (LAs) are IP addresses and the system can use an IP-based link-state routing protocol to do efficient routing for those addresses. Application Specific Addresses (AAs) are what end nodes have, and these are hidden by the switches except for the last hop.

The key point is that servers think they are all on the same IP subnet, so they can move anywhere they want and maintain the same AA, which will be turned into a different LA to keep routing correct and efficient.

In order to minimize hotspots and ensure fairness, each flow takes a random path through the network. ECMP is used to choose the best path. ARP queries (as in SEATTLE and PortLand) are intercepted and forwarded to a directory service to minimize flooding in the network.

Comments
This paper takes a common industry tack of, 'What do we have, what do we want, and how can we change as little as possible to get there?'. While this leads to very practical designs, one must ask if there are really the 'best' solutions.

The evaluation was quite good in this paper (wow, real hardware). They show that VL2 performs well under load, and provides fair allocation to services in the face of other services changing their usage patterns.

A major concern I have is that they try to keep their directory servers strongly consistent. While this could prevent odd (and hard to debug) situations in which servers are out of sync, it means that a network partition could make the entire service un-writable, effectively preventing new nodes from entering the network. While they hand-wave a bit about being able to sacrifice response time for reliability and availability (by timing out on one server and then writing to another), this seems to conflict with their statement that the replicated state machine (the actual service that maintains the mappings) 'reliably replicates the update to every directory server'.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Summary
PortLand is an attempt (much like SEATTLE) to make Ethernet more scalable. The motivations are much the same as the SEATTLE paper (discussed below), although the approach they take is quite different.

PortLand sacrifices some flexibility in exchange for better routing. They do not support arbitrary topologies, but rather assume the network will be organized as a "fat tree". That is, end-nodes connected to some number of levels of aggregation switches, and finally connected to 'core' switches before going out to the wide area. By making this assumption about topology they are able to maintain much less routing information at each switch, and avoid many of the broadcasts that are necessary to establish a topology in traditional Ethernet.

Switches go though a quite clever algorithm in which they can determine if they are 'edge' switches (switches connected to end-nodes), 'aggregate' switches (switches that connect to edge switches and/or aggregate switches below them, but also connect to some number of switches above them), or 'core' switches (only down links internally, and connection to the wide area). This greatly reduces the administration overhead, although it is unclear how well this algorithm will deal with hardware misconfiguration (i.e. things plugged into the wrong thing).

End-nodes are assigned a 'Pseudo MAC' (PMAC) which contains not only enough information to specify the target host, but also to locate it. Switches can use a prefix of the PMAC to route packets in the fat tree and can therefore maintain much less routing state.

There is also a 'fabric manager' which is responsible for handling ARP requests. Edge switches intercept ARP broadcasts, forward them to the fabric manager, which looks up the PMAC for the request, and returns it to the switch, which turns it back into an ARP response. The fabric manager is replicated and uses only soft-state.

Piggybacked on top of the auto-configuration protocol is a keep-alive system. If a location discovery message (they also call it a keepalive) is not received from a switch in a certain amount of time it is assumed to be down. The fabric manager aids in recovering from faults by receiving notifications of failures and relaying those failures to the relevant switches, which can then update their routing tables.

Comments
I quite liked the model this paper presented. The fat tree architecture is what many data-centers already use, and it's a clever way to exploit the topology. I especially thought the auto-configuration of switch positions was nice as it removes a large amount of the configuration overhead for the network. It was unclear, however, how switches were going to know they were connected to the wide area network correctly, specifically, what their default route out should be.

The evaluation section was quite disappointing in this paper. They did not compare themselves to any other systems, nor did I feel they gave a very convincing demonstration of their own scalability. They examine link failures, but don't look at failure of the fabric manager at all. The claim is that the fact that it is replicated will solve the problem, but it's also unclear how inconsistent fabric managers would impact the ability of PortLand to route packets.

I did like their thoughts on migrated VMs, and their ability to correctly migrate TCP connections. This is going to be a useful feature going forward in my opinion.

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

Summary
This paper identifies a major problem with existing diagnostic tools. They identify faulty *hosts*, but operators almost always already know what host is faulty, and rather have some application specific problem. Isolating these application faults is what is really what operators need. Thus the authors identify a conundrum: We want our diagnostic tool to be general (application agnostic) but to give application specific information (application specific).

As a resolution the paper takes an inference based reasoning approach, looking at feature vectors of various generic and specific application metrics and comparing their values now to to historical data. A dependency graph is generated between all hosts in the system, and directed edges are introduced between components that can directly affect each other. While doing analysis the algorithm assigns high weights to edges that have corrolated state. That is, if there is an edge from node A to node B and the historical states of A and B are similar to their current states, it is likely that something in A's state is causing B's state.

Using this weighted graph the system can identify probable culprits in an anomaly. A study of a number of reports sent to Microsoft showed that they could identify the correct culprit 80% of the time, and almost always have the culprit in their top 5 list.

Comments
This paper reads a bit differently being an industry paper, and can feel a bit like a marketing pitch at times. Nevertheless the technique was quite interesting and it clearly worked since I remember NetMedic being quite popular.

I also found the overall approach interesting. I got the sense that they began by looking at problem reports, figured out what a good solution would look like, and then set about finding a method to produce that solution.

One concern I have with this approach is that it requires you to run NetMedic on all hosts in your network, and statistics gathering software is prone to making systems unstable. Norton was/is notorious for this and one worries that the cure becomes worse than the problem.

Floodless in SEATTLE: A Scaleable Ethernet Architecture for Large Enterprises

Summary
This paper describes a new network architecture, designed to ease the scalability and manageability problems that Ethernet suffers from. They present a nice overview of how Ethernet works, and its limitations, and describe their own architecture and show how it addresses Ethernet's failing.

The authors point a few key reasons Ethernet runs into trouble in large networks:

Bridge forwarding tables can grow very large because Ethernet uses a flat addressing scheme
Host information needs to be flooded, which is very expensive on a large network
Traffic must be routed on a single spanning tree (to avoid flood loops), but this means it can not always take the shortest path, and puts a high load on nodes near the root of the tree
Using broadcast for DHCP/ARP is expensive

The authors then consider using a number of small Ethernet LANs connected together with IP, which deals with a number of the scalability problems. This approach works on the efficiency front, but is a manageability nightmare since all LANs need to be kept in sync. In addition mobility is a problem since host might move between subnets.

To address these challenges the authors propose two major innovations in SEATTLE. Firstly, link-state information is replicated at every switch using an lsprotocol, which means shortest path routing can be used. Since this is information is fairly stable the approach is reasonable. Secondly, the mappings of IP address to end-host MAC addresses are stored in a one-hop DHT so they can be looked up efficiently, even with many hosts. This also improves mobility since new mappings can be published into the DHT and only needs to be changed at the host(s) responsible for storing that mapping.

There are some tricks the authors use on the DHT front. Consistent hashing is used for the keys so there will be minimal data churn when nodes enter and leave the network. Virtual hosts are used to load balance over machines of different capacities.

Finally the authors show how existing end hosts would not need to be changed at all to plug into a SEATTLE network. This is quite important for adoption in my mind.

Comments
I'm always happy to see a paper about an application of a DHT. They are elegant answers in search of problems. This was quite a nice use of the technology. The paper also does a very nice job of both describing the current state of the world and highlighting its issues.

A couple of issues I have with the paper:
- The architecture seems quite susceptible to malicious nodes. A node returning incorrect mappings could wreak havoc with a subset of end-hosts. Also, any churn in the network could make routing very unstable. Perhaps is is less of a concern given that switches are more managed pieces of hardware than end-hosts.
- The problem of a host containing the mapping I want being far away from me is a tricky one. On the one hand, SEATTLE caches aggressively, so this might be less of an issue than I think, but their solution of multi-layer DHTs seemed a bit ad-hoc and also a management problem.

Still, an interesting solution to a real problem, and a good read. As this is a very recent paper it will be interesting to see if there is any adoption of SEATTLE, or something similar.

Monday, September 14, 2009

Congestion Control for High Bandwidth-Delay Product Networks

Summary
The title of this paper is a bit misleading. What this paper really does is propose a new protocol for congestion avoidance in the network, one that would replace TCP. The authors argue that TCP performs poorly in networks with a high bandwidth-delay product (hence the title), and argue that these types of networks are becoming more common in today's Internet.

The primary change that the eXplicit Control Protocol (XCP) makes to TCP is that it adds a congestion header that provides explicit notification of congestion to endpoints rather than relying on dropped packets to tacitly signal congestion like TCP does. This modification allows routers to be more clever about their policies, and interestingly allows routing with almost no dropped packets (in their simulations, which have all well-behaved receivers).

The protocol (at the endpoints) works roughly as follows:
Sender: set my current congestion window size and round trip time estimate in the header of each packet. ACKs to these packets will have a computed requested delta in my congestion window in there, adjust accordingly.
Receiver: Simply copy the delta that's been computed along the way from the sent packet into the ACK header.

Routers along the packet's path compute the delta. Each router can overwrite the header so the bottleneck should end up being the delta in the packet. Fairness and utilization deltas are computed separately allowing different policies to be used for each one. The authors test a multiplicative increase, multiplicative decrease for utilization with an additive increase, multiplicative decrease for fairness.

There are a number of simulations that show that XCP outperforms TCP in a variety of environments.

XCP does not deal with misbehaving senders any better than TCP, however, it is much easier to identify misbehaving senders with XCP. An edge router can send a delta ACK asking for a congestion window decrease. If the sender does not respond by decreasing to the new window, it is misbehaving and can be punished appropriately.

Thoughts
This was an interesting paper. Having to use dropped packets as a signal for congestion really limits the performance of TCP, especially in situations where the packet wasn't really dropped but was only assumed dropped, as can be the case in high bandwidth-delay product networks. This will also be true in things like wireless networks where dropped packets are not due to delay but due to a crummy link.

XCP seems to solve many of TCPs weaknesses quite well. There are a few concerns I have however.

Firstly, how to deal with a misbehaving router along the path of a packet. It seems this router could overwrite the congestion header and upset the whole algorithm. TCP will have problems with misbehaving routers as well, but XCP seems like it could suffer from more insidious problems here where, say, only one source has it's congestion headers bumped up, leading to unfairness.

Secondly, per flow state is still needed for identification of misbehaving receivers. This may be unavoidable, but it is still unfortunate. Moving this computation to edge routers could alleviate the problem somewhat, as will increasing raw power in routers.

Finally, adoption seems difficult here. Although the authors suggest a way to have XCP and TCP co-exist fairly, it still requires software upgrades in a huge number of points, and only one hop on the path not supporting XCP means the whole connection needs to fall back to TCP. This path seems difficult to go down, but perhaps as bandwidth-delay products keep rising the switch will become essential.

Sunday, September 13, 2009

Random Early Detection Gateways for Congestion Avoidance

Summary
This paper describes a method for controlling congestion in a network. The technique is used on inner nodes or gateways and is designed to be easy to implement, require no per-connection state, and to be somewhat fair in that it drops packets in a manner that is roughly proportional to a connections share of the total bandwidth at the gateway.

The basic technique is to maintain the average length of the queue using a sliding window and to operate in one of three modes based on two thresholds. If the queue length is below a min threshold then all packets are forwarded. If the queue length is above a max threshold all packets are marked for congestion and/or dropped. In between the two thresholds the router randomly drops packets with a probability proportional to the queue size.

This works because the queue will be aggressively kept below the max, but since there is a sliding window for the length computation the gateway can handle short bursts of traffic. In addition, connections sending lots of packets will be more likely to have their packets marked/dropped, so the gateway is in some sense 'fair'.

The paper presents a number of simulations showing that RED does indeed control queue size, and that usage of the network is still quite high, even in the face of congestion. They discuss a number of parameters that can be set in the algorithm and their effects on the performance of the gateway. Some optimal values are suggested.

Discussion
I enjoyed this paper. The authors make a good case for the practicality of their approach (showing it's cheap to implement) and show that the algorithm works well. I think the number of parameters, and the sensitivity of the model to those parameters could be a barrier to adoption. Admins might be worried that if they tweak something a little wrong performance could become very poor.

The approach also doesn't explicitly deal with badly misbehaving clients. While those clients will be more likely to have their packets dropped, they don't get punished any extra for misbehaving.

Still, the elegance and simplicity of the approach make it quite appealing and I found this a worthwhile paper.

Wednesday, September 9, 2009

Core-Stateless Fair Queueing

Summary
This paper describes a way of approximating fair queuing (FQ) that does not require routers in the middle of the network to classify packets into a particular flow. It is called Core-Stateless Fair Queuing (CSFQ) as routers in the core of the network need to maintain no per-flow state. This reduces the processing requirements on these core routers, while still providing more fair service than first-come-first-served and coming close to the performance of the much more expensive FQ.

Packets must still be classified, but this is only done at the edges of the network, where we can (maybe) assume flows will be slower and smaller. In addition the arrival rate of the flow is computed using exponential averaging. Packets are annotated with the results of these computations and a computation of the links fair rate estimation.

At core (or inner) routers this information is used to compute a constant time algorithm which probabilistically drops packets. This is done using the annoations of the arrival rate and fair share rate. Intuitively, the more a flow's arrival rate is above its fair share rate, the more likely it is to have a packet dropped. The authors bound the divergence from optimal their strategy can experience.

The authors run a number of simulations with various queing algorithms and show that CSFQ is much better than most other algorithms, and that is very close to deficit round robin, an efficient (but more expensive than CSFQ) FQ algorithm.

The paper also contains a fairly sizable discussion on what to do about misbehaving receivers. They argue that TCP-unfriendly flows should be managed, by something as simple as restricting them all the way up to dropping all their packets. The argument for dropping unresponsive (to dropped packets) flows is that there exist applications that don't care how many packets they loose (so-called fire-hose applications) that will simply flood the network no matter what scheme is in use, unless that scheme actively punishes them for their behaviour.

Thoughts
This was a very nicely argued paper. The authors clearly state their assumptions (even admitting that FQ might not even be necessary but asserting that if it is, their work is vital) and work forward from there. While I didn't check the details of the math in the paper, but I assume it is correct, and the results they derive seem reasonable.

It is a pity this scheme requires partitioning of routers in the network into edge and core. I suspect this would be difficult. Furthermore, any core would have to disallow any edge router than was not annotating, meaning a gradual upgrade to this scheme would be difficult if not impossible.

I also liked the discussion on misbehaving senders. This seems a real problem and I think active dissuasion is a good model for routers lest we end up with a million TCP stacks out there purporting to provide faster network access.

This paper was a bit heavy going at times, but a valuable read and one I would keep on the reading list.

Analysis and Simulation of a Fair Queueing Algorithm

Summary
This paper tackles the problem of building gateways that provide fair service in the face of congestion and possibly malicious end-points. Unlike the previous papers we read, this tackles the problem in the middle of the network, not at the end-points. That is not to say that end-point flow control shouldn't also be used; the two can be used in conjunction.

The authors argue that first-come-first-served (FCFS) is not an adequate algorithm for routers since a misbehaving source that can send packets fast enough can use an arbitrarily high amount of the available bandwidth. They propose fair queuing (FQ) as an alternative. Zhang's work on the Virtual Clock gateway queueing algorithm is cited as a similar approach.

Fair Queuing is basically the idea that we track the amount of data used by each "flow" (in this paper the authors choose to consider source-destination pairs as the unit of a flow) and split the available bandwidth evenly between each flow. This requires a per flow queue, such that if a particular flow sends another unit of data before its previous one has been sent it only adds to its own queue, causing drops during congestion.

The "most fair" FQ algorithm is a bit-by-bit round-robin (BR) approach in which one bit is sent from each flow each round such that each flow gets exactly the same amount of bandwidth. Since this is clearly infeasible the authors look at doing packet level queuing. The original propsal from Nagle did not take packet size into account, allowing users to game the system by sending large packets. The authors instead basically order packets by their finishing times (always picking the packet with the soonest finishing time to be sent next) which closely approximates BR.

A simple optimization is done for 'telnet' style connections. If a connection is inactive (as a interactive connection often will be during wait times) it is given a priority boost, providing better performance for interactive sessions.

The authors evaluate FQ vs. FCFS with a variety of flow-control algorithms and show that FQ provides better server and avoids congestion.

Thoughts
FQ deals nicely with the problem of misbehaving senders. They will simply overflow their queues without effecting anyone else. It also clearly provides more fair allocations even when all senders are behaving, since there may be various different flow-control algorithms operating simultaneously and they all might send at different rates.

I had two main concerns while reading this paper. First is the added complexity it adds to routers, that would have to classify packets as belonging to a particular flow. This is non-trivial and would require considerably more powerful routers. Second is the (somewhat) arbitrary choice of source-destination pairs as the unit of flow. This can be gamed by opening connections to lots of receivers, or could unintentionally punish powerful machines that just happen to be hosting many users connecting to the same service. Still, I don't have a better solution.