Re: The trouble with Bayes

Mike Grice 6 May 2005 12:40:32 -0000

On Fri, 2005-05-06 at 14:28 +0200, Paul Boven wrote:
> Hi everyone,
> 
> Here are some observations on using Bayes and autolearning I would like 
> to share, and have your input on.
> 
> Autolearning is turining out to be more trouble than it's worth. 
> Although it helps the system to get to know the ham we send and get, and 
> learn some of the spams on its own, it also tends to 'reward' the 'best' 
> spammers out there. Spams that hit none of the rules (e.g. the current 
> deluge of stock-spams) drive the score for all kinds of misspelled words 
> towards the 'hammy' side of the curve, which makes it possible for more 
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
> 
> The second weakness in the current Bayes setup concerns the 
> 're-training' of the filter. The assumption in Bayes is that if a mail 
> gets submitted for training, it will first be 'forgotten' and then 
> correctly learned as spam (or ham). But in order to 'forget', 
> SpamAssassin must be able to recognise that the submitted message is the 
> same as a previously autolearned one. Currently this is done by checking 
> the MsgID or some checksum of the headers. There are two potential 
> pitfalls here: Firstly, the retraining message is never exactly the same 
> as the original message. It's made another hop to the mailstore, or has 
> been mangled by Exchange or some user agent. Secondly, especially if the 
> original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID 
> would not be the same as the original. As soon as that happens, 
> retraining becomes far less powerfull: when the original faulty 
> autolearning doesn't get 'forgotten', the retraining will mostly cancel 
> it out, but never get a chance to correct the Bayes scores for those tokens.


DSPAM gets around this by assigning each message a DSPAM-ID, which is
kept in a choice of the body of the mail, attached to the mail, in the
headers.  It then keeps a record of every DSPAM-ID and looks for it in
the mail when its sent back for training.

I have problems with this method because it clobbers any database on a
sufficiently high-volume site (as does Bayes and AWL in general).  There
must be some other way to do it, but doing multiple writes to a database
for every mail passing through a system is a real resource glutton (and
so I have to have them disabled).

Users have problems with the above method because they don't like extra
stuff in their message (if the DSPAM-ID is at the bottom of every mail,
or attached), and if you put it in the headers a user cannot forward it
(because you don't get the headers in all cases).

Cheers
Mike

-- 
| Mike Grice                  Broadband Solutions for
| Systems Engineer                  Home & Business @
| PlusNet plc.                           www.plus.net
+ ----- PlusNet - The smarter way to broadband ------

Re: The trouble with Bayes

Reply via email to