Adventures in spam & Spambayes

This is really a bit of a me-too article, but I thought it worth summarising a modest Python success story. My hosting provider offers IMAP access and allows me to set up my own cron and procmail configuration. I use Thunderbird on several (Windows) machines and very occasionally Squirrelmail or even mutt if that’s all the access I’ve got. I’ve advertised my mail at timgolden address pretty widely and I’m not at all surprised to be receiving a few hundred spams every day.

I suppose everyone has their way of coping with spam and I’ve been using Spambayes for quite a while via a procmail filter, but the bsddb database kept corrupting during training (a known but unsolved issue, it seems) and in the end I just left the hammie.db in the last known state, without retraining, and carried on as best I could, clearing out my Inbox every few days. Then all of a sudden I seemed to get onto someone’s list and the situation became unmanageable. So… back to Spambayes to see if I couldn’t find a solution.

Well, the result was a fresh install of Spambayes (from svn, fwiw), specifying a pickle database since it seems to be less prone to corruption and the volumes I’m dealing with aren’t high, a slight reshuffling of my folders, and the use of Menno Smits’ recently rehoused imapclient lib. The whole process is as follows:

  • A cron job scans my mail folders every few hours and gathers from-addresses from known-to-be-good folders into a white list.
  • Another pair of cron jobs runs Spambayes’ sb_mboxtrain trainer on the to-ham and to-spam folders and then uses imapclient to remove the contents of those folders.
  • When mail comes in, it is whitelisted if it comes from a known-good address; if not, it is passed to Spambayes
  • Spambayes will tag it as ham, spam or unsure
  • A further procmail rule will drop it in the Inbox if it’s considered ham, into the Spam folder if it’s considered spam, or into Suspect.
  • I scan the Suspect folder periodically (manually) and classify messages by moving them to the to-spam folder or copying them to the to-ham folder and then moving to then Inbox or to some other folder.
  • Likewise, I move mail from the Inbox into one of the known-good folders so it will be whitelisted next time.
  • For the time being, I’m also scanning the Spam folder and fishing out the very occasional falsely-accused good email.

The result is remarkable: Spambayes very quickly identifies ham/spam pretty much 100% correctly; I haven’t had any database corruptions so far (about a week now); and I’ll pretty soon ignore the Spam folder and drop anything spambayes calls spam into /dev/null. It’s a little risky, but life is short and my experience is that Spambayes very rarely gets it wrong.

The use of the imapclient libs was new this time round (the rest of the process was only very slightly tweaked from its previous incarnation). And this means less for me to check. Just copy/move the email to to-ham/to-spam and forget about it.

One small thing which came out of this was that I discovered I could have folders on IMAP. I was sure I’d tried it previously and failed with some obscure error. This time, though, Thunderbird just told me: you can either have a folder-only folder or a mail-only folder and created it quite happily. I rely heavily on the Nostalgy add-in to Thunderbird. It means I can have a full-width two-pane display without the folder tree and still move things easily from folder to folder.

In short, a couple of Python libs: Spambayes & imapclient coupled with the ubiquitous procmail and I’ve got a very functional spam filter in place.

Notes:

  • I did look at SPF, but somehow wasn’t sure if the DNS incantation I was using was correct and never took it further.
  • Not sure if greylisting is an option with this hosting service, although people report good results from it in general

2nd London Python Dojo & TDD

The 2nd London Python Dojo took place last night, space & food again courtesy of Fry-IT. The format was pretty much the same with the difference that the task was more of a program-y one and less of an API-y one. Which had the result that the audience was far more engaged (read: lots of opinionated backseat drivers) than on the previous week. It was still fun and the proposal that we essentially carry on with the same problem domain (a noughts & crosses game) next time was fairly well received.

What interested me a little more was the differences of approach among the developers present, both those up-front and those in the cheap seats. As I touched on last week, a Test-Driven Development technique was assumed (at least by the organisers). Now, as far as I can tell, while this is a perfectly valid approach to development, it isn’t of the essence of Dojo — ie you don’t need to do TDD for a Dojo to work. The point of a Dojo is rather to code and learn in front of others. Neither does it need to involve pair programming per se.

Now my point is not that I disagree with these techniques, altho’ I’m happily not using them myself in my every day life, but rather that a certain amount of the “suggestions” from the body of the audience was centred on their use. One or two of the coders were clearly not accustomed to working that way, or even aware that you could perhaps, and my own feeling is that this should be perfectly permissible. I’m not saying that anyone was booed off stage for launching in without a test, but there were several strong voices of encouragement in the crowd pointing out that a failing test had not been written (or any test, for that matter) as though True Development were impossible without one!

FWIW, my view on Test-Driven Development is rather like my view on Object-Oriented Development: that it’s an arrow one should certainly have in one’s quiver but that it isn’t always applicable. I realise that the comparison is not the most apt, but go with it for now. I appreciate that the people who were coding were not necessarily in their element and that I may not have been seeing TDD at its best, but there were not a few moments when I felt that a test was being written simply because it should be, according to the Mantra, without any thought to the program, design, goals, structures etc. At one point it was suggested that a particular function should return a string rather than print it to the screen as it would be easier to form a test. Now my view is that if a function needs to print a string then it needs to print a string. The *test* shouldn’t be driving the needs of your program: the requirements should be doing that. (In that case, it could well have been a pragmatic choice since the alternative would presumably have been to construct a mocked sys.stdout but still…).

As I say, I’m sure I wasn’t seeing TDD at its best and brightest. I would genuinely welcome a Masterclass Dojo (or whatever they’re called) where someone walks through a test-driven development to show how it might be done. As it was, I felt that the need to invent a test for something before you did anything about it left you seeing only the trees and failing to get a grasp of the wider wood. My 2.5d-worth.

Getting WMI to work with Python 3.x

Well, that was easier than I thought…

Someone emailed me recently to say that he was new to Python but wanted to use WMI to query a bunch of machines in a University Computer Lab. He’d downloaded the module from my website but when he tried to import, he got this traceback… I was rather surprised: I don’t claim my code’s perfect, but the current release has been out in the field for a while and I’d have been surprised if something so serious hadn’t been picked up before. Anyway, as you probably guessed, he was using Python 3.1, having gone to python.org and downloaded the latest version. I simply advised him to go back to 2.6 where I knew there was no problem.

But of course, that got me thinking about porting to 3.x. I’ve more-or-less followed the progress of 2to3 and various people’s attempts at porting code to 3.x, and in particular Ned Batchelder’s article a few days ago got me thinking. And coding. And the result is wmi 1.4.2 (and counting), which not only runs on all Python versions from 2.4 to 3.1 but actually has a test suite to prove it. Plus a little web engine for browsing a machine’s WMI structure.

I’m in the process of porting the documentation over to Sphinx, but there’s a pre-release version (sans extra docs) on my website or you can follow the latest and greatest on Subversion.

Reverse sorting on arbitrary key segments

I have a database-y requirement to sort a list by a series of keys which are user-defined. Each of the keys could be a forward or a reverse sort.

In essence I have a list of dictionaries:

rows = [
  dict (name="John", age=30, id=123),
  dict (name="John", age=40, id=456),
  dict (name="Fred", age=20, id=567),
]

and a set of keys from the user:

keys = [
  ("name", "asc"),
  ("age", "desc")
]

and I want to call sorted with a key function to end up with:

[
  dict (name="Fred", age=20, id=567),
  dict (name="John", age=40, id=456),
  dict (name="John", age=30, id=123),
]

Obviously, I can’t just use reversed=True since that would reverse all or nothing. The main sorting page on the wiki rather oddly suggests that you sort against the first column and then against the second.

Which would seem to leave you with (ie age order descending):

[
  dict (name="John", age=40, id=456),
  dict (name="John", age=30, id=123),
  dict (name="Fred", age=20, id=567),
]

[Update: As several people have pointed out, the page actually suggests sorting against the second column reversed and then against the first, relying on the fact that Python’s sort is stable. But meanwhile, granted my misunderstood premise…]

The sorting-lists-of-dictionaries page comes closer, but assumes that all the keys are numeric.

What I’ve actually done is to create a Reversed class which simply reverses the sense of calling __lt__:

class Reversed (object):
  def __init__ (self, value):
    self.value = value
  def __lt__ (self, other):
    return self.value > other.value

and then to implement a key function like this:

def sorter (row):
  return tuple (
    (row[col] if dir=="asc" else Reversed (row[col])) 
      for (col, dir) in keys
  )

and sorted:

sorted (rows, key=sorter)

Now, all this certainly works, but is it a sane way to achieve the end? I was convinced that there should be something fancy I could do with a key function alone, avoiding an auxiliary class, but since the column values could be any datatype, including user-defined ones, I can’t just use the usual -number or -timedelta tricks.

If it looks like this is a useful technique, I’ll add it to the sorting wiki page. But I thought it best to subject it public scrutiny first. To many eyeballs… etc. etc.

London Python Code Dojo last night

Went to the advertised London Python Code Dojo last night. Not quite sure what to expect (altho’ Nicholas Tollervey, the front man, had done a readable write-up beforehand). It was great. About 30 people, some faces familiar to me from previous London Python meetups, others not. Beer & Pizza kindly supplied by the hosts, Fry-IT.

Altho’ a few people had attended other code dojos before, most of us were first-timers so there was an amount of feeling-the-way going on. Nicholas hooked his Mac up to a projector (with *both* kinds of editor available :) ) and people came up in pairs to code — pilot and co-pilot — for 10 minutes before handing over to a new co-pilot with the old co-pilot taking over the controls. The target app was a GraphViz graph of Twitter contacts, so the first 5 minutes was spent simply trying to set up a Twitter account with a name and email address which had not already been used!

Altho’ there were small issues — people using an unfamiliar environment, keyboard, editor etc — the 10 minute turnaround on each pair created a dynamism which kept the thing active. There is, apparently, an alternative approach where one guy stands at the front and talks through what he’s doing, but that doesn’t seem to me to have the same appeal.

There were several suggestions at the end as to what might be improved. The scaffolding code which Nicholas had already put in place to generate graphs given an edge-list was ideal since it made it feasible to actually create a solution within the 2 hours we were working. But some people thought more time was spent learning the Twitter API than was really useful. For my part I didn’t have a problem with that: it’s all part of the learning experience. The size of the group meant that people at the back of the room were less engaged. There were suggestions of two parallel groups competing, but I think it was decided to hold off till later on that.

What was interesting from my perspective was the way that different people approached the — admittedly loosely-specified — problem. There was an unspoken assumption that test-driven development was de rigueur, a discipline I don’t entirely share but am happy to go along with. What surprised me the most was that no-one fired up the interpreter to see what the Twitter API was doing. There were tests being written without (I think) knowing what the API was going to return. I’d just have started the interpreter, logged on, and retrieved a list of friends — or whatever — to see what I was getting back. But everyone’s different.

I don’t know if this is the idea, but one thing you do get is a kind of audience participation effect. Altho’ you have the pilot & co-pilot up front, verbalising their thought processes, you have a room full of back-seat drivers all giving advice at different times. Vastly entertaining.

Just a couple of suggestions from the point of view of a big group: maybe have the pilot / co-pilot hooked up to head-mics pushed through speakers; and have a slave laptop at the far end, projecting a VNC Viewer of the master onto a nearer screen/wall so people can see/hear what’s going on.

[I was the final co-pilot, for those who don’t know me :) ]

Passing params to db-api queries

Falling mostly into the aide-memoire category, but in case it’s helpful to anyone else…

You have a more-or-less complex SQL query which you’re executing via, eg, pyodbc (or some other dbapi-compliant module) and you need to pass in a set of positional parameters. So you have a where clause which looks something like this (although with better names, obviously):

WHERE
(
  t1.x = ? AND
  (t2.y = ? OR (t3.z = ? AND t2.y < ?))
)
OR
(
  t1.x > ? AND
  (t2.y BETWEEN ? AND ?)
)

So your Python code has to pass in seven parameters, in the right order, several of which are probably the same value. And then you realise that the WHERE clause is slightly wrong. So you adjust it, but now you have eight parameters, and two of the previous ones have changed, and there’s a new one. And then…

There’s no way to use named params with pyodbc, so you end up with a list/tuple of positional parameters which you have to eyeball-match up with the corresponding question marks in the query:

import pyodbc

...

cursor.execute (
  SQL, [
  from_date, threshold, threshold, to_date, interval, threshold]
)

Unless… you use a derived table in the query and use that to generate pseudo-named parameters. This is possible in MSSQL; I don’t know if it would work with other databases, although I can’t see why not. So your code becomes something like (NB no attempt at consistency here; it’s an example):

SELECT
  *
FROM
  t1
JOIN t2 ON t2.t1_id = t1.id
JOIN
(
  SELECT
    from_date = ?,
    to_date = ?,
    max_value = ?,
    interval = ?,
    threshold = ?
) AS params ON
(
  t1.x = params.from_date AND
  (t2.y = params.threshold OR
    (t3.z = params.interval AND t2.y < params.to_date)
  )
)
OR
(
  t1.x > params.threshold AND
  (t2.y BETWEEN params.from_date  AND params.to_date)
)

All you need to do then is to line up the order of params in your cursor.execute with the order of columns in the params derived table.

Alternatives? Well, you could use an ORM of some sort — goodness knows there are enough of them about — but maybe, like me, you find that learning another syntax for something which you can do perfectly well in its native SQL is onerous. Another approach is to set up local variables in your executed statement and use these in much the same way, eg:

DECLARE
  @v_from_date DATETIME,
  @v_to_date DATETIME,
  @v_threshold INT

SELECT
  @v_from_date = ?,
  @v_to_date = ?,
  @v_threhold = ?

SELECT
  *
FROM
  ..
WHERE
  (t1.x < @v_from_date ...)

This works (and is, in fact, how we generate lightweight SQL-to-Excel reports). But there’s a bit more boilerplate involved.

A round of applause to the cherrypy maintainers

I’ve been using cherrypy for a production server at work, hosting a web interface to our helpdesk system. It’s been stable and usable even over several upgrades (I started off somewhere at cherrypy 2.x). I recently raised an issue concerning a multipart form (such as you use with a file upload control) with non-ascii text dropped into it. And a fix was applied within a couple of weeks.

I’d like to think that the trouble I took to narrow the problem down to a repeatable case helped things along. (And, goodness knows, I’ve encouraged enough people to do that on the Python lists). But in any case I very much appreciate the response from the cherrypy developers. This is one of those annoying technical things which users just can’t understand: “But why does it crash when I put a pound sign into the text?” (or when you cut-and-paste from a Word doc and you get those smart-quotes).

But not only are they responsive to bug reports; they also keep their docs up-to-date, including the very useful sections which indicate what’s changed from previous versions.

smtplib and failed recipients

Just a quick aide-memoire and a note to anyone else who’s caught out… when using Python’s smtplib module to send email. If you’re like me, you may have missed the following documented behaviour:

This method will return normally if the mail is accepted for at least one recipient. Otherwise it will throw an exception. That is, if this method does not throw an exception, then someone should get your mail. If this method does not throw an exception, it returns a dictionary, with one entry for each recipient that was refused. Each entry contains a tuple of the SMTP error code and the accompanying error message sent by the server.

In other words, sendmail will return successfully even if some of the recipients couldn’t receive the email. You can work out which recipients failed from the dictionary returned. I’d assumed that if *any* recipient didn’t get the email then an exception would be raised. As I say, the actual behaviour is clearly documented, but just in case…

Python & Excel

Chris Withers has just set up a website at: http://www.python-excel.org/ to act (if I understand correctly) as a sort of springboard for people who are looking for help in using Excel with Python. With that domain name, it might also become a company specialising in Python training I suppose. Chris points out that if you simply drop: python excel into Google or Bing.. or anywhere[*] you get a whole scatter of possibly useful pages but no clear focus of information.

The Python Excel Google Group has been in existence for a while now and people are pointed there from the Python mailing lists with a certain regularity but it doesn’t figure particularly highly in search engine results either. Hopefully the presence of python-excel.org will at least provide a focus for search engine enquiries.

[*] “Balham, Gateway to the South” (about 3:09)