On Mon, 21 May 2007, Fletcher Mattox wrote:
Hi,
After years of stability, my bayes db is doing poorly. When I first
noticed it, it was classifying lots of ham BAYES_99, I cleared the db
and started over. Now it finds *very* few ham.
0.000 0 3 0 non-token data: bayes db version
0.000 0 14779 0 non-token data: nspam
0.000 0 86 0 non-token data: nham
0.000 0 231925 0 non-token data: ntokens
0.000 0 1177142672 0 non-token data: oldest atime
0.000 0 1179789654 0 non-token data: newest atime
0.000 0 1179789681 0 non-token data: last journal sync atime
0.000 0 1179761284 0 non-token data: last expiry atime
0.000 0 43200 0 non-token data: last expire atime delta
0.000 0 90881 0 non-token data: last expire reduction
count
I've seen people report large spam/ham ratios on this list, but this
seems extreme, >170:1. So I added about 500 ham (I am sure of the
quality) to the db with "sa-learn --ham", hoping that would help.
But it is still behaving poorly, over 20% of my ham is BAYES_99.
(Normally less the 1% of my ham is BAYES_99.)
Does anyone know why my system can't find any ham? It's a fairly typical
university site of about 10000 messages/day with a 50/50 ham/spam ratio,
so I know it is receiving plenty of ham. Running 3.2.0 if it matters.
Do you have custom values for bayes_auto_learn_threshold_nonspam and
bayes_auto_learn_threshold_spam? If you do, the values may be loose enough
to where more spam is detected than ham.
I don't run with auto learn anymore. I tried using it site wide and even
with extremely loose values, things still got out of whack. Spam was
learned as ham and visa versa. It was a nightmare to keep track and
correct. With the traffic I see, auto learn site wide will have learned
the default amount for bayes to start filtering within six hours.