I am having Bayes false positive misclassifications and am trying to tune and improve this situation. I am using SpamAssassin to classify mailing list messages and so there is a lot of mail from a variety of sources feeding SA. And a lot of spam of course.
Periodically, not very often, every year or so, the Bayes engine seems to go bad. It classifies too much mail with BAYES_95 or BAYES_99. The messages that trigger this always seem normal. I do send all of those corrections back to 'sa-learn --ham' but even doing so things seem to stick in the bad mode. I have in the past cleared the database and started over which does solve the problem for another year or so. It is back stuck now. To mitigate the problem I have significantly reduced the scores from BAYES_95 and BAYES_99 so that they have less overall impact and I raise the required_hits so that false negatives are reduced. But then I miss out on the goodness! And it doesn't seem to be recovering. This time I decided I would investigate further (instead of clearing and restarting) and perhaps learn something. A bunch of data at the end of this message but (sa-learn --dump magic): 0.000 0 1557 0 non-token data: nspam 0.000 0 57088 0 non-token data: nham 0.000 0 179777 0 non-token data: ntokens I am quite surprised to see the "nspam" number be reported to be so very low! Something seems wrong. (Of course it is best if the number of nham and nspam can be about the same.) Trust me that there is plenty of spam coming through. My own personal email SA Bayes dump magic in a good happy state shows this. 0.000 0 19848 0 non-token data: nspam 0.000 0 63162 0 non-token data: nham 0.000 0 167947 0 non-token data: ntokens And therefore something seems broken that the bad one would have such a low number of nspam there. It definitely has a lot of valid email going through 'sa-learn --ham' every day. It hasn't been reset in recent memory, not at least for a year. So I feel that this is a clue as to the unhealthy state of things now. I am hoping to learn why a Bayes db can go bad and how to keep it healthier. What is the best way to understand the health of the Bayes database? I read through the "EXPIRATION" section of the sa-learn man page but not having looked at the code some of it was difficult for me to understand. But I imagine that the low number of nham is due to tokens having been expired for some reason. In a default SpamAssassin installation without configuration how long would a token hang around in the database assuming that no new message token "refreshes" it? Thanks, Bob Various details: $ ll -hog .spamassassin/bayes_* -rw------- 1 12K Feb 7 16:30 .spamassassin/bayes_journal -rw------- 1 75M Feb 7 16:29 .spamassassin/bayes_seen -rw------- 1 5.0M Feb 7 16:29 .spamassassin/bayes_toks $ sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 1557 0 non-token data: nspam 0.000 0 57088 0 non-token data: nham 0.000 0 179777 0 non-token data: ntokens 0.000 0 1360231447 0 non-token data: oldest atime 0.000 0 1360277897 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1360274921 0 non-token data: last expiry atime 0.000 0 43200 0 non-token data: last expire atime delta 0.000 0 5817 0 non-token data: last expire reduction count $ crontab -l | grep sa-learn 7 * * * * test -d $HOME && sa-learn --force-expire >/dev/null $ date -R Thu, 07 Feb 2013 16:38:50 -0700 $ date -R -d@1360231447 # oldest atime Thu, 07 Feb 2013 03:04:07 -0700 $ date -R -d@1360277897 # newest atime Thu, 07 Feb 2013 15:58:17 -0700 $ date -R -d@1360274921 # last expiry atime Thu, 07 Feb 2013 15:08:41 -0700 Running on a Debian Stable Squeeze 6.0 system with SpamAssassin 3.3.1 and with sa-update running routinely by cron with the SOUGHT ruleset added to the default update channel.