Bayes false postive correction tuning

Bob Proulx Thu, 07 Feb 2013 16:13:48 -0800

I am having Bayes false positive misclassifications and am trying to
tune and improve this situation.  I am using SpamAssassin to classify
mailing list messages and so there is a lot of mail from a variety of
sources feeding SA.  And a lot of spam of course.


Periodically, not very often, every year or so, the Bayes engine seems
to go bad.  It classifies too much mail with BAYES_95 or BAYES_99.
The messages that trigger this always seem normal.  I do send all of
those corrections back to 'sa-learn --ham' but even doing so things
seem to stick in the bad mode.  I have in the past cleared the
database and started over which does solve the problem for another
year or so.  It is back stuck now.

To mitigate the problem I have significantly reduced the scores from
BAYES_95 and BAYES_99 so that they have less overall impact and I
raise the required_hits so that false negatives are reduced.  But then
I miss out on the goodness!  And it doesn't seem to be recovering.

This time I decided I would investigate further (instead of clearing
and restarting) and perhaps learn something.

A bunch of data at the end of this message but (sa-learn --dump magic):

  0.000          0       1557          0  non-token data: nspam
  0.000          0      57088          0  non-token data: nham
  0.000          0     179777          0  non-token data: ntokens

I am quite surprised to see the "nspam" number be reported to be so
very low!  Something seems wrong.  (Of course it is best if the number
of nham and nspam can be about the same.)  Trust me that there is
plenty of spam coming through.  My own personal email SA Bayes dump
magic in a good happy state shows this.

  0.000          0      19848          0  non-token data: nspam
  0.000          0      63162          0  non-token data: nham
  0.000          0     167947          0  non-token data: ntokens

And therefore something seems broken that the bad one would have such
a low number of nspam there.  It definitely has a lot of valid email
going through 'sa-learn --ham' every day.  It hasn't been reset in
recent memory, not at least for a year.  So I feel that this is a clue
as to the unhealthy state of things now.

I am hoping to learn why a Bayes db can go bad and how to keep it
healthier.  What is the best way to understand the health of the Bayes
database?

I read through the "EXPIRATION" section of the sa-learn man page but
not having looked at the code some of it was difficult for me to
understand.  But I imagine that the low number of nham is due to
tokens having been expired for some reason.

In a default SpamAssassin installation without configuration how long
would a token hang around in the database assuming that no new message
token "refreshes" it?

Thanks,
Bob


Various details:

$ ll -hog .spamassassin/bayes_*
-rw------- 1  12K Feb  7 16:30 .spamassassin/bayes_journal
-rw------- 1  75M Feb  7 16:29 .spamassassin/bayes_seen
-rw------- 1 5.0M Feb  7 16:29 .spamassassin/bayes_toks

$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       1557          0  non-token data: nspam
0.000          0      57088          0  non-token data: nham
0.000          0     179777          0  non-token data: ntokens
0.000          0 1360231447          0  non-token data: oldest atime
0.000          0 1360277897          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1360274921          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire atime delta
0.000          0       5817          0  non-token data: last expire reduction 
count

$ crontab -l | grep sa-learn
7 * * * *       test -d $HOME && sa-learn --force-expire >/dev/null

$ date -R
Thu, 07 Feb 2013 16:38:50 -0700

$ date -R -d@1360231447  # oldest atime
Thu, 07 Feb 2013 03:04:07 -0700

$ date -R -d@1360277897  # newest atime
Thu, 07 Feb 2013 15:58:17 -0700

$ date -R -d@1360274921  # last expiry atime
Thu, 07 Feb 2013 15:08:41 -0700

Running on a Debian Stable Squeeze 6.0 system with SpamAssassin 3.3.1
and with sa-update running routinely by cron with the SOUGHT ruleset
added to the default update channel.

Bayes false postive correction tuning

Reply via email to