JerryKindall.com: Once Upon a Time on the Web


Click thumbnail to enlarge

Golden Gate
1/15/2005
6 comments

 

Current
2007 Archives
   August
   June
   March
2006 Archives
   December
   November
   May
   April
   March
   February
   January
2005 Archives
   December
   November
   October
   September
   August
   July
   June
   May
   April
   March
   February
   January
2004 Archives
2003 Archives
2002 Archives
2001 Archives

Gallery
Download
Comments
Links

About Jerry
Amazon Wish List

MeFi-Projects

Seattle Pancakes

© 2001-2010 Jerry Kindall


Search this site
Search the Web



 

Wednesday 11/15/06

Back in 2002, Jeremy Bowers wrote an article asserting that statistical filters for spam were our last line of defense, that they are doomed to eventually fail, and that once they did we would all be buried under an avalanche of unwanted mail. I responded with this post and he responded to me and others with this post.

Four years later, statistical filtering remains a valuable weapon in the war on spam. At my former day job, I turned off the automatic server-side filtering (based on SpamAssassin) and used Thunderbird's statistical filter because it just worked better.

Statistical filtering has turned out not to be the "last stand" for spam filtering, either. Graylisting is one relatively new approach that has become much more prevalent since 2002. (I see dates of 2003-ish on graylisting articles.)

As Bowers correctly pointed out, it's an arms race. But it's not by any means one that spam filters are doomed to lose. Not anymore. What he missed is that statistical filters mean that for the first time, the the arms race is easier for anti-spammers than for the spammers. Because once you've done the hard work of implementing a statistical filter, it becomes trivial to increase the number of characteristics of messages you test.

For example, determining whether a message is very short or whether the subject line contains all caps is a one-liner in most scripting languages. Before statistical filtering, the hard part wasn't writing the rule, but deciding how much weight to give it. With a statistical filter, though, you just add a token for each characteristic you test to the message before it's passed to the statistical classifier (tokens like "MESSAGE_VERY_SHORT" or "SUBJECT_LINE_ALL_CAPS") and let it decide how important each characteristic is. If you're a competent programmer, you can implement dozens, perhaps hundreds, of different message checks in a weekend, based on every characteristic of a message you can think of. You don't even have to worry about whether they are useful, because a modern computer can handle them all without breaking a sweat. In fact, it's likely most of them won't be useful today -- but they may be useful later, as spammers' tactics evolve. In any case, each new rule gives spammers incrementally less wiggle room in crafting their crapulent enticements.

These days I get 1 or 2 spam messages a day on my main mail account using a combination of spamtraps, site-specific addresses, blacklisting, prompt delays, a non-existent secondary MX, and some server-side filters that reject bogus bounces and obvious spam. My main public-facing addresses, which all funnel into that account, have pretty strict rules on message size and MIME type. Most blocked messages are not rejected outright; every potentially legitimate rejected sender gets a message telling them how to get on my whitelist and get their message through. (Much to my surprise, after a couple years of this, I still have no spammers on my whitelist.) The few spams that do get through, mostly from other mail accounts I don't use much, are nailed easily by SpamSieve and a handful of Entourage rules.

I really don't have to take quite such an active role -- I just find it convenient to run my own mail server, and I like tinkering. My mom uses a GMail account, and its spam filtering is quite successful as well -- nearly as good as mine, with none of the work. So while the war against spam rages on, I think the good guys are largely winning.

Re: The spam war

There are 2 messages in this thread, displayed in the order they were posted.

Jeremy Bowers 11/16/2006 8:08:55 PM Pacific

"Manual trackback".

I use Thunderbird client filtering and nothing else. It's not quite perfect, but it's close enough that it's not worth fussing with.

Jerry Kindall 11/18/2006 2:34:53 AM Pacific
Yeah, Thunderbird is really quite good, I was impressed when I used it at work.

It is currently 3/10/2010 10:25:15 PM Pacific.

Name:
(required) 
E-mail:
(optional) 
URL:
(optional) 
Enter your comments below. Leave a blank line between paragraphs. You may use <B>, <I>, and <A> HTML tags for formatting and linking, but you need not use HTML for line and paragraph breaks. Your e-mail address will not be displayed publicly.
      

aspcomments2 by Jerry Kindall based on aspcomments by sneaker