A method for sending Spam that has become quite prevelant of late is having a bot running on an end host sign up for a web-based email account (hotmail, gmail etc) and then using that email to send Spam from the end host. The traditional CAPTCHA protection for new account signups is being subverted by either routing the CAPTCHA to some other 'sketchy' site (i.e. asking uses to solve CAPTCHA to download pirated software) or simply paying people a trivial amount to solve them.
This paper looks at a way of identifying these bot accounts using the signup and access patterns gleaned from the hotmail logs. They identify two main activities they consider suspicious:
- A sudden increase in the number of signups from a particular IP. The authors use an exponentially-weighted moving average (EWMA) to detect these sudden increases and limit the sign-up rate making it harder for spammers to obtain new accounts.
- The authors assume that there will be more accounts than machines available. Thus if multiple accounts are accessed from the same IP they stand a higher chance of being bot-created accounts. To reduce false positives (i.e. from machines changing IPs due to DHCP) they actually require the IPs to be from different ASes to count.
- Finally they look at the number of messages send by each account. While this isn't used in their detection, they use it as part of the evaluation. The observation is that human accounts don't often send more than 3 emails a day (although most people I know send far more than that), and so accounts that send more are candidates for bot accounts.
The authors compare two methods for computing the above. The first partitions on client IP and looks for accounts logged into by that IP. This unfortunatly requires sending lots of weight one edges over the network which is inefficient at they won't be considered. The second partitions on account names and looks for which IPs logged into that account. This allows some pruning of weight one accounts which makes it more efficient.
Comments
I thought this was an interesting approach, but I'm not sure how difficult it would be to subvert, if spammers knew it was running. For example, it would be easy to throttle the number of account signups and slowly increase them, to defeat the EWMA detection. Secondly, it seems fairly trivial to tie accounts to the AS that created them, thereby defeating the other line of defence. Obviously spammers don't do this at the moment as this system isn't deployed, but were it to be I'm sure they would adapt quickly. Perhaps a system based on IP prefixes (to not bias against DHCP) would be more effective. There is also the problem of many computers being behind a single IP, which this analysis will miss.
Still, I thought this was an interesting approach, and a nice use of a distributed framework for graph processing (which is basically what MR was originally designed for). Perhaps more subtle patterns and metrics could be used for a more difficult to circumvent system. I would keep this paper on the sylabus.