Tuesday, November 24, 2009

BotGraph: Large Scale Spamming Botnet Detection

Summary
A method for sending Spam that has become quite prevelant of late is having a bot running on an end host sign up for a web-based email account (hotmail, gmail etc) and then using that email to send Spam from the end host. The traditional CAPTCHA protection for new account signups is being subverted by either routing the CAPTCHA to some other 'sketchy' site (i.e. asking uses to solve CAPTCHA to download pirated software) or simply paying people a trivial amount to solve them.

This paper looks at a way of identifying these bot accounts using the signup and access patterns gleaned from the hotmail logs. They identify two main activities they consider suspicious:

  1. A sudden increase in the number of signups from a particular IP. The authors use an exponentially-weighted moving average (EWMA) to detect these sudden increases and limit the sign-up rate making it harder for spammers to obtain new accounts.
  2. The authors assume that there will be more accounts than machines available. Thus if multiple accounts are accessed from the same IP they stand a higher chance of being bot-created accounts. To reduce false positives (i.e. from machines changing IPs due to DHCP) they actually require the IPs to be from different ASes to count.
  3. Finally they look at the number of messages send by each account. While this isn't used in their detection, they use it as part of the evaluation. The observation is that human accounts don't often send more than 3 emails a day (although most people I know send far more than that), and so accounts that send more are candidates for bot accounts.
The paper focuses on figuring out accounts that are accessed from IPs on different ASes. This is a huge computation and so they resort to distributed processes a la Map-Reduce to solve it. The idea is that accounts that are accessed with bot patterns will form large connected components in the user-user (vertex for each user) graph. Edge weights between accounts are assigned based on the number of shared logins across ASes. To begin they look at connected componens with edge weights of 2 and then keep increasing the threshold by 1 until no component has more than 100 members. Large connected components are considered likely to be groups of bot accounts.

The authors compare two methods for computing the above. The first partitions on client IP and looks for accounts logged into by that IP. This unfortunatly requires sending lots of weight one edges over the network which is inefficient at they won't be considered. The second partitions on account names and looks for which IPs logged into that account. This allows some pruning of weight one accounts which makes it more efficient.

Comments
I thought this was an interesting approach, but I'm not sure how difficult it would be to subvert, if spammers knew it was running. For example, it would be easy to throttle the number of account signups and slowly increase them, to defeat the EWMA detection. Secondly, it seems fairly trivial to tie accounts to the AS that created them, thereby defeating the other line of defence. Obviously spammers don't do this at the moment as this system isn't deployed, but were it to be I'm sure they would adapt quickly. Perhaps a system based on IP prefixes (to not bias against DHCP) would be more effective. There is also the problem of many computers being behind a single IP, which this analysis will miss.

Still, I thought this was an interesting approach, and a nice use of a distributed framework for graph processing (which is basically what MR was originally designed for). Perhaps more subtle patterns and metrics could be used for a more difficult to circumvent system. I would keep this paper on the sylabus.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

Summary
This paper tackles the problem of separating user generated activity (emails, web requests and clicks) from the same activities being performed by a bot infesting the computer. The authors leverage the Trusted Platform Module (TPM) to start verified and trusted code to perform the necessary operations on the end hosts for their system to work.

The basic idea of the system is that actual human users can't generate all that many events over time. For example, it is very unlikely that a human will want to send more than one email per second (although emails with huge CC lists seem like a problem for the system). With this observation, the Not-A-Bot system proposes to have software request an attest (a token signed with a private key from the TPM) when it wants to perform an action (like send an email) which will check if an action from an input device (keyboard or mouse) has happened in a threshold number of seconds (1 in the paper), and if so, grant the attest that can be included in a header of the request. A verifier on the receiving end can use this attest to be more confident that the message or request is from a human a not a bot.

For added security the content of the message is part of the attest so it is impossible for an attest to be re-used by a bot to send multiple messages. Further, even though bots could simulate input actions they could still only send at the reduced threshold rate.

The authors implement NAB using the Xen hypervisor to prevent access from the untrusted OS into the attester's address space. They strip down Xen to reduce the size of the codebase that needs to be trusted. The paper also covers (in somewhat laborious detail) the usage of the TPM to verify the code size of the attester and to obtain the private key.

To evaluate their system the authors collected activity data from a number of users at Intel and overlay honeypot collected data for spam. They show that NAB can suppress 92% of spam email, 89% of bot generated DDoS requests, and 87% of fraudulent click-throughs. This seems to assume verifiers running everywhere.

Comments
I thought this paper took an interesting approach to solving the bot problem, but in the end I remain somewhat unconvinced of their methodology.

Firstly, even if bots could only send at (say) 1 email per second, there would still be a huge amount of bot generated email flying around given the sheer number of computers infected by bots.

Secondly, incremental deployment of this system seems difficult. If only a few end hosts are using NAB then all traffic needs to be treated as 'normal' anyway, since biasing against unsigned requests would certainly have very high false positive rates. If low numbers of servers are using verifiers then there is little gain to be had by running NAB on an end host and paying the extra costs.

Finally, the whole architecture seems a bit hard to swallow. It seems a bit much to ask users to run a hypervisor under their OS for this purpose and without the hypervisor the system is open to attack. The attester's size is verified at launch, before the host OS, but if the host OS can get at the address space of the attester it could easily be replaced with something that always grants attests, making the system useless.

This wasn't a bad paper, but given that the system seems a bit weak, and the lack of a serious networking component to the paper, I would probably remove this paper.

Thursday, November 19, 2009

Cutting the Electric Bill for Internet-Scale Systems

This was an interesting paper, and a bit of a shift from the types of papers we have read so far in the class. There are two main motivators for the paper:
  1. Internet-Scale systems are already widely replicated. This means major providers (the googles and amazons and so on) already have data centers in varied geographic locations and can route traffic for a particular application to any of a number of those data centers. In short, a request can be serviced (almost) equally by any of a number of data centers.
  2. Power does not cost the same amount in disparate geographic locations. The paper presents a lot of data on this front and determines that there is a large hour-by-hour pricing in differential between energy markets
Given the about motivators, it makes sense for a Internet-Scale system to route requests to data centers where power is as cheap as possible. This of course assumes that keeping data centers at low load allows one to save power. This is called the energy elasticity in the paper, and the authors claim that it is about 60% at the moment (that is, systems use about 60% of the power they use under load when then are idle). This is not a great number, but this is an active area of research and one can probably safely assume that it will only get better. Systems that can actually turn off machines under low load could help to achieve perfect (0%) elasticity.

The authors use numbers from Akami to run some simulations and show that significant savings are possible. One roadblock is the 95/5 billing model where traffic is divided into 5 minute intervals and the 95th percentile is used for billing. This inhibits the amount of work the system can actually offload. By relaxing this the authors can do quite a bit better.

I thought this was an interesting paper and one I would keep on the syllabus.

Scalable Applicaiton Layer Multicast

The (at least currently correct) assumption in this paper is that we are not going to have in network multicast anytime soon. Given this, the authors develop an application level multicast scheme that tries to get as close to 'ideal' in network performance.

The approach this paper takes is to build an overlay network out of participating applications and route multicast traffic using this overlay. The protocol they develop for this purpose is called NICE.

In NICE nodes are arranged in levels. Each level is grouped into nodes that are 'near' each other (distance defined by network distance and latency), and each group is assigned a 'leader' who is the graph theoretic center of the nodes in the group. This leader also participates in the next layer up with all the leaders of the groups below and a leader of that group is picked in the same manner. In this recursive fashion a tree is built which can be used to forward packets. To multicast a packet the leader of a group simply forwards to everyone else in the group. If any of those nodes are leaders of groups below them they will take the same action in those groups until everyone has the packet.

The authors evaluate the stress (how many extra packets need to go over a link compared to an in network scheme) and stretch (how many more hops packets take vs. in network) and also compare themselves to Narada, another application level multicast scheme. They come out ahead of Narada, but do impose significant overhead vs. in network schemes. The stress of roughly 1.6 in steady state would be a non-trival performance hit. Even worse the path length is much longer in NICE than in IP Multicast (about 23 hops vs. 15) which means many more chances for bottleneck links in the wide area.

I was also concerned about the join protocol. When a new node joins it finds a cluster close to it to join. If it is the center of the cluster is becomes the new leader. In the face of churn, many new leaders could generate a large amount of control traffic and an unstable tree. There was not really any evaluation of the robustness of NICE in the face of churn.

Wednesday, November 18, 2009

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

The reliable multicast framework paper is interesting in it's development of their model. They begin with an application (wb, a multiuser whiteboard) and from there attempt to factor out what they deem traits that could be common to many applications that need multicast. Their overall approach is to keep as much functionality as they can in the client (as dictated by the end-to-end argument), doing only best effort delivery and requiring the clients to implement reliability and packet ordering enforcement.

The key contribution of this paper seems to be their mechanism for retransmitting lost packets. These losses are indicated by the clients, and the idea is to minimize the amount of extra data that needs to be sent. Clients determine a lost packet by their reception of a packet with a higher sequence number than expected. If this happens, a timer is set based on the distance of the client from the source (further away = longer timer). When this timer expires a request for retransmission is sent. If the packet is received while waiting, the timer is canceled. In this way it is expected that the first host after a lossy link will request retransmission and that it will get a prompt reply from the host just upstream of the loss. A small amount of randomness is added to the timer to try and still only use one request when two hosts equidistant from the source both miss a packet. In theory the randomness will allow one host to request retransmission first and since the response is multicast the other host will see it as well and cancel its timer.

This means hosts need to cache old requests so they can respond to retransmission requests. The amount of state needed to make this all work is not completely clear, nor is the issue addressed in the paper. Still, it's an interesting approach, and the authors run a number of simulations showing that if all state is kept their protocol performs well.

Thursday, November 12, 2009

Summary
The X-Trace paper presents a system for tracing causal paths through multi-host network architectures to help with debugging distributed applications. The authors identify the problem that modern systems are often made up of many, heterogeneous hosts and that a request must traverse a number of these hosts in its lifetime. Finding out why a given request is not performing as expected requires investigating all the hosts along that path, a difficult and time-consuming task.

X-Trace proposes to deal with this problem by adding a small amount of data in-band with each request. Each participating host must be modified to append its own data to the request, for example, an http-proxy adds some metadata, then pushes the request to the appropriate host, which adds its own data and so on. The recursive nature of this propagation assumes no particular architecture a-priori. Once the request is finished, the total metadata can be examined and used to build the resulting task tree. This tree can be used to discover what amounts of time are being taken at each component, and/or to identify any failures along the path at a particular component.

The paper then evaluates X-Trace in three example situations
  1. A web request with recursive DNS queries. Using X-Trace the authors are able to follow the recursive DNS query and isolate faults along the way. This also allowed them to see which cache the result was coming from and know why updated addresses might be out of date.
  2. A web hosting provider. Using X-Trace allowed for determining that a request was failing, say, in a PHP script, not in a DB query.
  3. An overlay network. Probably the most complex scenario, X-Trace allows for distinguishing between a number of possible faults in Chord including the receiving node failing, or a middle-box failure, either a particular process crashing, or the whole host going down.
Comments
I like this paper a lot. It gives a clear outline of the problem and proposes a useful and practical solution. I've used X-Trace in a couple of projects and it most certainly helps the debugging of a distributed application. In one case we were confused about why a request was taking so long. X-Trace allowed us to pinpoint the location in some client code that was doing a slow type conversion. There were at least 4 other systems involved in the request and each one would have to have been examined without the X-Trace data.

There is some concern that requiring modification to the client will hurt adoption. While this might be true, it's hard to see how something like X-Trace could work without such modification. I also think inter-node communication libraries like thrift might help the situation. We've added X-Trace support to the thrift protocol, making the actual client modifications a one line trivial change.

I would keep this paper on the syllabus.

NetFPGA: A Tool for Network Research and Education

Summary
Falling somewhere between pure software solutions like Click and completely custom hardware, NetFPGA gives students and researches a programmable hardware platform for designing network software. This allows users to actually interact with the hardware and deal with any associated issues, without actually having to build the hardware itself.

Users program the NetFPGA remotely by uploading programs which they can then send data to and receive data from. The devices can also be linked into the Stanford network so they can carry real data. Using this platform students have built an IP routers, making it clearly a good learning tool. However, to enable more complex research the new version has an on-board CPU, (the previous version did not, and was controlled by sending special packets from a remote controller) which is being used, for example, to write a 'shunt' that allows low cost intrusion detection by only offloading packets to the detector that really need to be examined.

Comments
There doesn't seem to be a whole lot of downside to this project. I do wonder what performance is like compared to a custom hardware router and Click, but this doesn't seem like a huge issue on a research platform.