Comments working again
I finally did it and wrote the spam filter that I had promised a while back. It was less work than I thought, actually. Anyway, you can now write comments again.
The filter is a so-called Naive Bayes filter. It calculates the probability that a comment is spam, based on how often the words in the comment were observed in spam comments and in normal comments. The implementation generally follows [the english Wikipedia article about this](http://en.wikipedia.org/wiki/Bayesian_spam_filtering), without any additional heuristics for rare words and the like.
If anyone cares, I can post the code for you all to read. It isn’t that much. The most significant single point that I noticed was that the spam filter might go crazy if it finds a word that was never seen either as spam or as not-spam, which is a so-called zero frequency problem. To solve that, whenever I add a new word, I first set both sightings as spam and sightings as not-spam to one (and then one more for whatever I saw it as). This makes the results slightly less accurate, but it remains good enough to work.
Currently the filter has three levels. If the probability that a post is spam is higher than 95%, then the comment isn’t even written to the database, but rejected immediately. A comment that has a chance of more than 70% is saved, but remains hidden until I’ve decided whether it is spam or not. Every time I make such a decision, the spam filter gets trained a little bit to become more accurate. Of course, I may have to change these thresholds in the future.
Written on July 3rd, 2010 at 11:36 pm
Septdeneuf
Torsten Kammer (admin)
Björn
Torsten Kammer (admin)
Torsten (admin)