On Mon, 21 May 2007, Fletcher Mattox wrote:

Hi,

After years of stability, my bayes db is doing poorly.  When I first
noticed it, it was classifying lots of ham BAYES_99, I cleared the db
and started over.  Now it finds *very* few ham.

0.000          0          3          0  non-token data: bayes db version
0.000          0      14779          0  non-token data: nspam
0.000          0         86          0  non-token data: nham
0.000          0     231925          0  non-token data: ntokens
0.000          0 1177142672          0  non-token data: oldest atime
0.000          0 1179789654          0  non-token data: newest atime
0.000          0 1179789681          0  non-token data: last journal sync atime
0.000          0 1179761284          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire atime delta
0.000          0      90881          0  non-token data: last expire reduction 
count

I've seen people report large spam/ham ratios on this list, but this
seems extreme,  >170:1.  So I added about 500 ham (I am sure of the
quality) to the db with "sa-learn --ham", hoping that would help.
But it is still behaving poorly, over 20% of my ham is BAYES_99.
(Normally less the 1% of my ham is BAYES_99.)

Does anyone know why my system can't find any ham?  It's a fairly typical
university site of about 10000 messages/day with a 50/50 ham/spam ratio,
so I know it is receiving plenty of ham.  Running 3.2.0 if it matters.

Do you have custom values for bayes_auto_learn_threshold_nonspam and bayes_auto_learn_threshold_spam? If you do, the values may be loose enough to where more spam is detected than ham.

I don't run with auto learn anymore. I tried using it site wide and even with extremely loose values, things still got out of whack. Spam was learned as ham and visa versa. It was a nightmare to keep track and correct. With the traffic I see, auto learn site wide will have learned the default amount for bayes to start filtering within six hours.

Reply via email to