Bayesian Filtering of Spam

The hit rate of this blog is now about 3 per day: I am talking about comment spam. This is no fun at all. I intended to turn off the back-channels like comments and trackback eventually pingback too. But that is no fun. So I started to look around for spam blocking solutions. Beside blacklists – mentioned earlier here – BayesianClassification caught my eyes. So I added a new wiki page to have some starting point. Also most solutions are intended to fight email spam there could be a combination with blacklists I guess.

There is currently no PHP implementation so I started to convert the C implementation of Paul Graham\’s \”A Plan for Spam\” [1] by Craig Morrison [2]. Unfortunately I got stuck because the PECL sqlite extension does not yet include the sqlite_compile and sqlite_step functions which are used in Craig\’s version to do some fun stuff with SQLite:

sqlite_compile() is used as a precursor to sqlite_step(). It takes an
SQL statement and \”compiles\” it into a VM (virtual machine) that sqlite
uses for each successive call to sqlite_step().

What it does is allow me to query a database without using a callback
function. Each call to sqlite_step() returns the next row in the result
set from the initial sqlite_compile() call.

I needed to do that, because I have to do lookups in the same database
and I could not do those lookups recursively inside the callback which
would be needed when using sqlite_exec(). The use of the virtual
machines allows me to maintain a seperate state for each lookup that I
need to do.

Craig was answering me via email as I asked him if there is a possible workaround to come along without the mentioned functions. Thanks to Craig.

[1] http://www.paulgraham.com/spam.html
[2] http://sourceforge.net/projects/bayesiancfilter

Leave a Reply