Hello Thomas,

Am Donnerstag, 17. Februar 2005 19:17 schrieb Thomas Bolioli:
> Interesting but what happens in the case where someone, like me, is
> getting 250+ spam a day and only about ten or so legitimate emails? This
> is not counting this account that my mailing lists go to which I have
> far better bayes performance on (1:100 spam/ham ratio instead of 10:1 or
> lower with my other accounts). With autotraining turned on, that means
> far more spam will get trained.

Yes. 

> Even if I turned off auto training, and 
> trained only the ham that came through, it would simply allow changes in
> spam to begin to defeat the bayes filter over time, is that not so?

Yes.

You must train both ham and spam frequently to catch up small changes in the 
mails.  Bayes needs to know which tokens are in ham and which are in spam. 
For the filter it doesn't matter what you call ham or spam. It just collects 
information about to classes 'ham' and 'spam' decides  on the statistical 
date to which class a new message may belong. For the filter it does not 
matter if you have a high spam to ham or a high ham to spam rate.  

Bayesian filtering is done on tokens seen before. If you don't train spam you 
will spoil your filter, because he doesn't learn new tokens. 

If you train 1 : 1 the the filter assumes that 50% of you mail is ham and 50% 
is spam. In real it may be that 96% is spam.

What happens when your spam ratio is 100 to 1?
This is a extrem example:

Ham = 100
Spam = 10000
[EMAIL PROTECTED]@: in 50 ham and 100 spam

=> 100 / (50+100) = 66.3%

Every second ham message contains the tolken and one of 100 spam messages.
That means if you get a message with the token it is in 2 of 3 cases spam!
Taht what bayesian filtering say i got a message it a has special tokens and 
due to the history it was in 2 of 3 times spam.

What happens when you train only 100 spam massages to get te ratio 1:1.

Ham (100% trained) = 100
Spam (1% trained) = 100
[EMAIL PROTECTED]@: in 50 ham and 1 ( =1 % ) spam (we where lucky and got one 
message with 
the token.)

=> 1 / (50+1) = 1.9%

The bayesian filter will now say that the message is with a probability of 
1.9% spam and with 98.1% ham. The fliter is useless it declares everything to 
ham. If the ratio is the other way it declares everything as ham.

> Doesn't that mean that the expiration system that SA employs solves that
> problem?

No. Expiration only reduces the size of the database. It drops unused tokens, 
which for a long time didn't apear in a message. So if you don't train spam 
you will have no spamy tokens anymore which spoils the filter.

Regards

Thomas Arend

[..]
-- 
icq:133073900
http://www.t-arend.de

Attachment: pgpCJ2edZsPLV.pgp
Description: PGP signature

Reply via email to