I know that the calculations have changed in 2.60
to be more accurate, and that one result of this is that messages tend toward
0%, 50%, and 99% instead of the spread they used to have. I also realize
that the scores have not yet been adjusted to account for this.
That said, I seem to get many more false positives
with 2.60 than I did with 2.55. A quick check of my inbox, with ~9000
messages in it, shows that there were 7 messages with Bayes > 50% in all
previous versions over the past few months, and 62 with 2.60 since June 29 -
and 23 of those had BAYES_99. (None of these are spam.)
I'm using a site-wide bayes file, since I'm the
only user. Every time I upgrade SA, I wipe out the bayes database, and
re-train SA on my inbox, some spam-free archives of my mailing lists, on the
"probably spam" folder (5<score<10) that I've weeded of ham, and on
archives from my spamtrap mailboxes; about 30,000 messages total. In the
most recent upgrade, of course, the new expiration algorithm meant that most of
the recently-learned words were instantly discarded when I rebuilt the
database. Every night, SA relearns the probably-spam folder, and every
night it relearns the "Ham" folder where I move any miscategorized spam.
Spamtrap mailboxes run spamassassin -r, which doesn't actually report anywhere
but which I believe learns the message as spam. Aside from the ham
folder and auto-learning, there's no regular relearning of ham.
How can I do a better job of training
bayes?
Jay Levitt
|