On Fri, 2005-05-06 at 14:28 +0200, Paul Boven wrote: > Hi everyone, > > Here are some observations on using Bayes and autolearning I would like > to share, and have your input on. > > Autolearning is turining out to be more trouble than it's worth. > Although it helps the system to get to know the ham we send and get, and > learn some of the spams on its own, it also tends to 'reward' the 'best' > spammers out there. Spams that hit none of the rules (e.g. the current > deluge of stock-spams) drive the score for all kinds of misspelled words > towards the 'hammy' side of the curve, which makes it possible for more > of that kind of junk to slip trough even if it hits SURBLSs or other rules. > > The second weakness in the current Bayes setup concerns the > 're-training' of the filter. The assumption in Bayes is that if a mail > gets submitted for training, it will first be 'forgotten' and then > correctly learned as spam (or ham). But in order to 'forget', > SpamAssassin must be able to recognise that the submitted message is the > same as a previously autolearned one. Currently this is done by checking > the MsgID or some checksum of the headers. There are two potential > pitfalls here: Firstly, the retraining message is never exactly the same > as the original message. It's made another hop to the mailstore, or has > been mangled by Exchange or some user agent. Secondly, especially if the > original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID > would not be the same as the original. As soon as that happens, > retraining becomes far less powerfull: when the original faulty > autolearning doesn't get 'forgotten', the retraining will mostly cancel > it out, but never get a chance to correct the Bayes scores for those tokens.
DSPAM gets around this by assigning each message a DSPAM-ID, which is kept in a choice of the body of the mail, attached to the mail, in the headers. It then keeps a record of every DSPAM-ID and looks for it in the mail when its sent back for training. I have problems with this method because it clobbers any database on a sufficiently high-volume site (as does Bayes and AWL in general). There must be some other way to do it, but doing multiple writes to a database for every mail passing through a system is a real resource glutton (and so I have to have them disabled). Users have problems with the above method because they don't like extra stuff in their message (if the DSPAM-ID is at the bottom of every mail, or attached), and if you put it in the headers a user cannot forward it (because you don't get the headers in all cases). Cheers Mike -- | Mike Grice Broadband Solutions for | Systems Engineer Home & Business @ | PlusNet plc. www.plus.net + ----- PlusNet - The smarter way to broadband ------