-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Keith C. Ivey writes:
> Arlo Gilbert <[EMAIL PROTECTED]> wrote:
> 
> > it would appear from the data im seeing that bayes is learning
> > the to and received headers on mails... obviously this seems a
> > bit redundant and will only add to the size of the bayes db,
> > without contributing anything (maybe even harming the learning?)
> > of the bayes engine.
> 
> The Bayesian learning uses the last two "Received" lines and 
> the "To" line, but I don't see why you think those are 
> redundant.  For example, if no spam has ever been sent to a 
> particular address before, then its presence in the "To" line 
> is a pretty good indicator that the message is not spam, 
> whereas the presence of the "info@" address for your domain (or 
> some address that's been retired because it gets too much spam) 
> may be a fair indicator that the message is spam.  Remember 
> that the "To" line can contain multiple addresses.
> 
> Also, the "To" line can contain comments, names, and completely 
> bogus addresses that spammers put in, and tokens found there 
> can be significant indicators of spam.

Yep.  One important point is that Received and To headers often
contain forged data inserted by spam tools, and these are often
very useful tokens.  e.g. a Windows-style non-RFC-822-compliant
date format, with "AM"/"PM" tokens, in Received headers, is
a good spam-sign.

> And the size of the Bayes DB is limited anyway, with the least 
> significant tokens being purged periodically, so the added 
> tokens aren't increasing the size.
> 
> The people who developed the Bayes tokenizing for SA have done 
> analysis on how effective various strategies are, and I'm 
> inclined to trust their analysis unless you have some better 
> analysis that refutes it.

Yeah -- every tweak to bayes gets a 10-fold cross-validation
testing run, to see if it helps or not.   Sometimes they do,
sometimes it doesn't -- which can be counter-intuitive until
you examine the results closely.

The reports on these runs can be found on the SpamAssassin-devel
list archives -- months ago unfortunately ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/kxL3QTcbUG5Y7woRAiEiAJ0SgEobMrK4Or5YlME9u0z6J0sZRgCg4b4u
qv8g00D+C3oYPRwk7UKjaKE=
=ft6Q
-----END PGP SIGNATURE-----



-------------------------------------------------------
This SF.net email sponsored by: Enterprise Linux Forum Conference & Expo
The Event For Linux Datacenter Solutions & Strategies in The Enterprise 
Linux in the Boardroom; in the Front Office; & in the Server Room 
http://www.enterpriselinuxforum.com
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to