Paul Boven wrote:
Hi everyone,

Here are some observations on using Bayes and autolearning I would like to share, and have your input on.

Autolearning is turining out to be more trouble than it's worth. Although it helps the system to get to know the ham we send and get, and learn some of the spams on its own, it also tends to 'reward' the 'best' spammers out there. Spams that hit none of the rules (e.g. the current deluge of stock-spams) drive the score for all kinds of misspelled words towards the 'hammy' side of the curve, which makes it possible for more of that kind of junk to slip trough even if it hits SURBLSs or other rules.

The second weakness in the current Bayes setup concerns the 're-training' of the filter. The assumption in Bayes is that if a mail gets submitted for training, it will first be 'forgotten' and then correctly learned as spam (or ham). But in order to 'forget', SpamAssassin must be able to recognise that the submitted message is the same as a previously autolearned one. Currently this is done by checking the MsgID or some checksum of the headers. There are two potential pitfalls here: Firstly, the retraining message is never exactly the same as the original message. It's made another hop to the mailstore, or has been mangled by Exchange or some user agent. Secondly, especially if the original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID would not be the same as the original. As soon as that happens, retraining becomes far less powerfull: when the original faulty autolearning doesn't get 'forgotten', the retraining will mostly cancel it out, but never get a chance to correct the Bayes scores for those tokens.

The end-users at my site are fairly good at submitting their spams to the filter (and fairly vocal if the filter misses too much). But there are also accounts that are not being read by humans. Accounts that gate onto mailing-lists. All these get spam too, and the spam gets autolearned, sometimes in the wrong direction. With retraining only partially effective as shown above, what happens in the end is that some spams, by virtue of sheer volume and sameness, manage to bias the filter in the wrong direction. Surely I'm not the only one who experiences this, because 'My Bayes has gone bad' is a frequent subject in this forum.

Some suggestions on improving the performance of the Bayes system:

1.) Messages that have been manually submitted should have a higher 'weight' in the Bayes statistics than autolearned messages.

2.) There should be a framework within SpamAssassin that makes it easy for end-users to submit their spam for training. Currently, there are all kinds of scripts available outside the main SpamAssassin distribution (I've written my own, too) that attempt to get the message out of the mail-client or server and as close as possible to the original, to feed back to Bayes. Which is close to impossible with some of the mail-servers out there. SpamAssassin currently only includes half the Bayes interface: you can have auto-learning, but for manual learning or retraining you're on your own to some extent.

3.) Message classification should not be on something as fragile as a mail-header or checksum thereof, but on the actual content. The goal of this classifier should be to be able to identify a message as being learned before, despite what has happened to it after having gone trough SpamAssassin

4.) The Bayes subsystem should store this classification, and all the tokens it learned. This way we can be sure that we correctly unlearn a autolearned message. The entries in this database could be timestamped so they can be removed after some months, to prevent unlimited growth.

Bayes is a very powerfull system, especially for recognising site-specific ham. But at this moment, apx. 30% of the spam that slips trough my filter has 'autolearn=ham' set. And another 60% of the spam slipping trough has a negative Bayes score to help them along. For the moment, I've disabled the autolearning in my Bayes system.

Regards, Paul Boven.




Several of the reasons you've mentioned is the reason I don't do autolearn. Manual, and user feed back, imho, is the best way to get the Bayes db up to spam fighting levels. It may be more troublesome for some ISP's who have a mix of mails have more trouble, but here we have a pretty standard set of mails, in that, I mean that mail to many of our users sounds about the same. I can grab out of our archives a few dozen mails, send them to my server's 'ham' box and let the cron job train those.


As far as standard interface, there is no standard mail server/os/environment. This is generally something the admin or a 3rd party would need to have drafted up. I have scrips that were created from 3rd party parts, munged, and grafted into my own. We here have a standard mail client, and a standard way of the users submitting mails to the global spambox. My scripts remove any markup from that transmission (these are not forwarded, but are redirected) and drop the mail files into a spam folder, where I look at the mails to make sure they are of spam quality, then the last step is to move them to where the Linux server picks up the mails and does the training. You can see there are a lot of steps, however, this ensures a user doesn't on accident train the wrong mail.

From my own testing, SA does create a hash of the mail for the msg-id. We were thinking that if a spamer created a message with the exact message id every time, that could bypass any training on future near spam messages, because the learner would ignore the mail. There are flaws to every system, but the ones in the present one, imo, are not that bad to the point that makes it unusable. Bayes in SA is very good, it does a good job, and is easy to setup and train -- but at the same token, it's easy to hose your db with incorrect training (as you've seen.) That's why I've got so many steps to the training process. But then again, I use rbls etc at get go on the SMTP conversation that blocks a vast majority of spams so I don't have to use BW or store any spams that made it past the initial SMTP conversation.

--
Thanks,
James

Reply via email to