On Fri, 20 Mar 2015 17:09:29 -0400
"Kevin A. McGrail" <kmcgr...@pccc.com> wrote:

> And I've heard arguments for and against removing the poisoning 
> information.  YMMV.

I think it seldom pays to be too clever with Bayes.  If (and this is a
big if) you have a large enough sample of mail, in our experience it's
better just to shovel it all into Bayes than to be selective about
what you present to Bayes.  The Bayes algorithms are usually pretty
good at picking out the signal from the noise.

Bayes expiry is a tricky thing.  To do expiry in a way that can be justified
mathematically, you really should expire messages, not individual tokens.
Otherwise, you're skewing the probabilities.  Doing it properly is unwieldy
because you have to remember all the messages (or at least, all the tokens
in the messages) going back over your expiry window.

What we do is twice a day, we build a brand new Bayes database from scratch
containing messages we've seen in the last 14 days.  The database
contains tokens from about 5.1 million spams and 4.5 million hams, totalling
about 18 million tokens.

Obviously, for this to work, you need a large message volume and a large
number of people marking stuff as ham vs. spam.  It's probably not a feasible
approach for small-to-medium SpamAssassin installations.

Regards,

David.

Reply via email to