On Fri, 20 Mar 2015 17:09:29 -0400 "Kevin A. McGrail" <kmcgr...@pccc.com> wrote:
> And I've heard arguments for and against removing the poisoning > information. YMMV. I think it seldom pays to be too clever with Bayes. If (and this is a big if) you have a large enough sample of mail, in our experience it's better just to shovel it all into Bayes than to be selective about what you present to Bayes. The Bayes algorithms are usually pretty good at picking out the signal from the noise. Bayes expiry is a tricky thing. To do expiry in a way that can be justified mathematically, you really should expire messages, not individual tokens. Otherwise, you're skewing the probabilities. Doing it properly is unwieldy because you have to remember all the messages (or at least, all the tokens in the messages) going back over your expiry window. What we do is twice a day, we build a brand new Bayes database from scratch containing messages we've seen in the last 14 days. The database contains tokens from about 5.1 million spams and 4.5 million hams, totalling about 18 million tokens. Obviously, for this to work, you need a large message volume and a large number of people marking stuff as ham vs. spam. It's probably not a feasible approach for small-to-medium SpamAssassin installations. Regards, David.