Arlo Gilbert <[EMAIL PROTECTED]> wrote:

> it would appear from the data im seeing that bayes is learning
> the to and received headers on mails... obviously this seems a
> bit redundant and will only add to the size of the bayes db,
> without contributing anything (maybe even harming the learning?)
> of the bayes engine.
The Bayesian learning uses the last two "Received" lines and 
the "To" line, but I don't see why you think those are 
redundant.  For example, if no spam has ever been sent to a 
particular address before, then its presence in the "To" line 
is a pretty good indicator that the message is not spam, 
whereas the presence of the "info@" address for your domain (or 
some address that's been retired because it gets too much spam) 
may be a fair indicator that the message is spam.  Remember 
that the "To" line can contain multiple addresses.

Also, the "To" line can contain comments, names, and completely 
bogus addresses that spammers put in, and tokens found there 
can be significant indicators of spam.

And the size of the Bayes DB is limited anyway, with the least 
significant tokens being purged periodically, so the added 
tokens aren't increasing the size.

The people who developed the Bayes tokenizing for SA have done 
analysis on how effective various strategies are, and I'm 
inclined to trust their analysis unless you have some better 
analysis that refutes it.

-- 
Keith C. Ivey <[EMAIL PROTECTED]>
Washington, DC



-------------------------------------------------------
This SF.net email sponsored by: Enterprise Linux Forum Conference & Expo
The Event For Linux Datacenter Solutions & Strategies in The Enterprise 
Linux in the Boardroom; in the Front Office; & in the Server Room 
http://www.enterpriselinuxforum.com
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to