I know that the calculations have changed in 2.60 to be more accurate, and that one result of this is that messages tend toward 0%, 50%, and 99% instead of the spread they used to have.  I also realize that the scores have not yet been adjusted to account for this.
 
That said, I seem to get many more false positives with 2.60 than I did with 2.55.  A quick check of my inbox, with ~9000 messages in it, shows that there were 7 messages with Bayes > 50% in all previous versions over the past few months, and 62 with 2.60 since June 29 - and 23 of those had BAYES_99.  (None of these are spam.)
 
I'm using a site-wide bayes file, since I'm the only user.  Every time I upgrade SA, I wipe out the bayes database, and re-train SA on my inbox, some spam-free archives of my mailing lists, on the "probably spam" folder (5<score<10) that I've weeded of ham, and on archives from my spamtrap mailboxes; about 30,000 messages total.  In the most recent upgrade, of course, the new expiration algorithm meant that most of the recently-learned words were instantly discarded when I rebuilt the database.  Every night, SA relearns the probably-spam folder, and every night it relearns the "Ham" folder where I move any miscategorized spam.  Spamtrap mailboxes run spamassassin -r, which doesn't actually report anywhere but which I believe learns the message as spam.  Aside from the ham folder and auto-learning, there's no regular relearning of ham.
 
How can I do a better job of training bayes?
 
Jay Levitt

Reply via email to