Hello Thomas, Am Donnerstag, 17. Februar 2005 19:17 schrieb Thomas Bolioli: > Interesting but what happens in the case where someone, like me, is > getting 250+ spam a day and only about ten or so legitimate emails? This > is not counting this account that my mailing lists go to which I have > far better bayes performance on (1:100 spam/ham ratio instead of 10:1 or > lower with my other accounts). With autotraining turned on, that means > far more spam will get trained.
Yes. > Even if I turned off auto training, and > trained only the ham that came through, it would simply allow changes in > spam to begin to defeat the bayes filter over time, is that not so? Yes. You must train both ham and spam frequently to catch up small changes in the mails. Bayes needs to know which tokens are in ham and which are in spam. For the filter it doesn't matter what you call ham or spam. It just collects information about to classes 'ham' and 'spam' decides on the statistical date to which class a new message may belong. For the filter it does not matter if you have a high spam to ham or a high ham to spam rate. Bayesian filtering is done on tokens seen before. If you don't train spam you will spoil your filter, because he doesn't learn new tokens. If you train 1 : 1 the the filter assumes that 50% of you mail is ham and 50% is spam. In real it may be that 96% is spam. What happens when your spam ratio is 100 to 1? This is a extrem example: Ham = 100 Spam = 10000 [EMAIL PROTECTED]@: in 50 ham and 100 spam => 100 / (50+100) = 66.3% Every second ham message contains the tolken and one of 100 spam messages. That means if you get a message with the token it is in 2 of 3 cases spam! Taht what bayesian filtering say i got a message it a has special tokens and due to the history it was in 2 of 3 times spam. What happens when you train only 100 spam massages to get te ratio 1:1. Ham (100% trained) = 100 Spam (1% trained) = 100 [EMAIL PROTECTED]@: in 50 ham and 1 ( =1 % ) spam (we where lucky and got one message with the token.) => 1 / (50+1) = 1.9% The bayesian filter will now say that the message is with a probability of 1.9% spam and with 98.1% ham. The fliter is useless it declares everything to ham. If the ratio is the other way it declares everything as ham. > Doesn't that mean that the expiration system that SA employs solves that > problem? No. Expiration only reduces the size of the database. It drops unused tokens, which for a long time didn't apear in a message. So if you don't train spam you will have no spamy tokens anymore which spoils the filter. Regards Thomas Arend [..] -- icq:133073900 http://www.t-arend.de
pgpCJ2edZsPLV.pgp
Description: PGP signature