Adventures in spam & Spambayes

This is really a bit of a me-too article, but I thought it worth summarising a modest Python success story. My hosting provider offers IMAP access and allows me to set up my own cron and procmail configuration. I use Thunderbird on several (Windows) machines and very occasionally Squirrelmail or even mutt if that’s all the access I’ve got. I’ve advertised my mail at timgolden address pretty widely and I’m not at all surprised to be receiving a few hundred spams every day.

I suppose everyone has their way of coping with spam and I’ve been using Spambayes for quite a while via a procmail filter, but the bsddb database kept corrupting during training (a known but unsolved issue, it seems) and in the end I just left the hammie.db in the last known state, without retraining, and carried on as best I could, clearing out my Inbox every few days. Then all of a sudden I seemed to get onto someone’s list and the situation became unmanageable. So… back to Spambayes to see if I couldn’t find a solution.

Well, the result was a fresh install of Spambayes (from svn, fwiw), specifying a pickle database since it seems to be less prone to corruption and the volumes I’m dealing with aren’t high, a slight reshuffling of my folders, and the use of Menno Smits’ recently rehoused imapclient lib. The whole process is as follows:

  • A cron job scans my mail folders every few hours and gathers from-addresses from known-to-be-good folders into a white list.
  • Another pair of cron jobs runs Spambayes’ sb_mboxtrain trainer on the to-ham and to-spam folders and then uses imapclient to remove the contents of those folders.
  • When mail comes in, it is whitelisted if it comes from a known-good address; if not, it is passed to Spambayes
  • Spambayes will tag it as ham, spam or unsure
  • A further procmail rule will drop it in the Inbox if it’s considered ham, into the Spam folder if it’s considered spam, or into Suspect.
  • I scan the Suspect folder periodically (manually) and classify messages by moving them to the to-spam folder or copying them to the to-ham folder and then moving to then Inbox or to some other folder.
  • Likewise, I move mail from the Inbox into one of the known-good folders so it will be whitelisted next time.
  • For the time being, I’m also scanning the Spam folder and fishing out the very occasional falsely-accused good email.

The result is remarkable: Spambayes very quickly identifies ham/spam pretty much 100% correctly; I haven’t had any database corruptions so far (about a week now); and I’ll pretty soon ignore the Spam folder and drop anything spambayes calls spam into /dev/null. It’s a little risky, but life is short and my experience is that Spambayes very rarely gets it wrong.

The use of the imapclient libs was new this time round (the rest of the process was only very slightly tweaked from its previous incarnation). And this means less for me to check. Just copy/move the email to to-ham/to-spam and forget about it.

One small thing which came out of this was that I discovered I could have folders on IMAP. I was sure I’d tried it previously and failed with some obscure error. This time, though, Thunderbird just told me: you can either have a folder-only folder or a mail-only folder and created it quite happily. I rely heavily on the Nostalgy add-in to Thunderbird. It means I can have a full-width two-pane display without the folder tree and still move things easily from folder to folder.

In short, a couple of Python libs: Spambayes & imapclient coupled with the ubiquitous procmail and I’ve got a very functional spam filter in place.

Notes:

  • I did look at SPF, but somehow wasn’t sure if the DNS incantation I was using was correct and never took it further.
  • Not sure if greylisting is an option with this hosting service, although people report good results from it in general