Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

Summary
VL2 is another approach to solving Ethernet's scaling problems. The authors set out three goals for the design of their new system:
  • Always achieve the max server-to-server capacity, limited only by the NICs on the sender and receiver
  • Don't allow one service to degrade the performance of another
  • Allow addresses to move to anywhere in the data center (i.e. a flat address space)

To achieve these goals VL2 plays a similar trick to PortLand in which they have the address for a server encode its location. Location-specific Addresses (LAs) are IP addresses and the system can use an IP-based link-state routing protocol to do efficient routing for those addresses. Application Specific Addresses (AAs) are what end nodes have, and these are hidden by the switches except for the last hop.

The key point is that servers think they are all on the same IP subnet, so they can move anywhere they want and maintain the same AA, which will be turned into a different LA to keep routing correct and efficient.

In order to minimize hotspots and ensure fairness, each flow takes a random path through the network. ECMP is used to choose the best path. ARP queries (as in SEATTLE and PortLand) are intercepted and forwarded to a directory service to minimize flooding in the network.

Comments
This paper takes a common industry tack of, 'What do we have, what do we want, and how can we change as little as possible to get there?'. While this leads to very practical designs, one must ask if there are really the 'best' solutions.

The evaluation was quite good in this paper (wow, real hardware). They show that VL2 performs well under load, and provides fair allocation to services in the face of other services changing their usage patterns.

A major concern I have is that they try to keep their directory servers strongly consistent. While this could prevent odd (and hard to debug) situations in which servers are out of sync, it means that a network partition could make the entire service un-writable, effectively preventing new nodes from entering the network. While they hand-wave a bit about being able to sacrifice response time for reliability and availability (by timing out on one server and then writing to another), this seems to conflict with their statement that the replicated state machine (the actual service that maintains the mappings) 'reliably replicates the update to every directory server'.

No comments:

Post a Comment