[SAtalk] 2.60: Bayes false positives - better training?

Jay Levitt Sat, 16 Aug 2003 12:21:56 -0700

I know that the calculations have changed in 2.60 to be more accurate, and that one result of this is that messages tend toward 0%, 50%, and 99% instead of the spread they used to have. I also realize that the scores have not yet been adjusted to account for this.

That said, I seem to get many more false positives with 2.60 than I did with 2.55. A quick check of my inbox, with ~9000 messages in it, shows that there were 7 messages with Bayes > 50% in all previous versions over the past few months, and 62 with 2.60 since June 29 - and 23 of those had BAYES_99. (None of these are spam.)

I'm using a site-wide bayes file, since I'm the only user. Every time I upgrade SA, I wipe out the bayes database, and re-train SA on my inbox, some spam-free archives of my mailing lists, on the "probably spam" folder (5<score<10) that I've weeded of ham, and on archives from my spamtrap mailboxes; about 30,000 messages total. In the most recent upgrade, of course, the new expiration algorithm meant that most of the recently-learned words were instantly discarded when I rebuilt the database. Every night, SA relearns the probably-spam folder, and every night it relearns the "Ham" folder where I move any miscategorized spam. Spamtrap mailboxes run spamassassin -r, which doesn't actually report anywhere but which I believe learns the message as spam. Aside from the ham folder and auto-learning, there's no regular relearning of ham.

How can I do a better job of training bayes?

Jay Levitt

[SAtalk] 2.60: Bayes false positives - better training?

Reply via email to