Re: Bayes problem: very large spam/ham ratio

Fletcher Mattox Tue, 22 May 2007 15:30:27 -0700

Duane Hill writes:
>On Mon, 21 May 2007, Fletcher Mattox wrote:
>
>> Hi,
>>
>> After years of stability, my bayes db is doing poorly.  When I first
>> noticed it, it was classifying lots of ham BAYES_99, I cleared the db
>> and started over.  Now it finds *very* few ham.
>>
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0      14779          0  non-token data: nspam
>> 0.000          0         86          0  non-token data: nham
>> 0.000          0     231925          0  non-token data: ntokens
>> 0.000          0 1177142672          0  non-token data: oldest atime
>> 0.000          0 1179789654          0  non-token data: newest atime
>> 0.000          0 1179789681          0  non-token data: last journal sync 
>> atime
>> 0.000          0 1179761284          0  non-token data: last expiry atime
>> 0.000          0      43200          0  non-token data: last expire atime 
>> delta
>> 0.000          0      90881          0  non-token data: last expire 
>> reduction count
>>
>> I've seen people report large spam/ham ratios on this list, but this
>> seems extreme,  >170:1.  So I added about 500 ham (I am sure of the
>> quality) to the db with "sa-learn --ham", hoping that would help.
>> But it is still behaving poorly, over 20% of my ham is BAYES_99.
>> (Normally less the 1% of my ham is BAYES_99.)
>>
>> Does anyone know why my system can't find any ham?  It's a fairly typical
>> university site of about 10000 messages/day with a 50/50 ham/spam ratio,
>> so I know it is receiving plenty of ham.  Running 3.2.0 if it matters.
>
>Do you have custom values for bayes_auto_learn_threshold_nonspam and 
>bayes_auto_learn_threshold_spam? If you do, the values may be loose enough 
>to where more spam is detected than ham.


No, I have not changed the thresholds (-1 and 12, respectively).
SA just isn't autolearning spam as easily as it used to.  For
example, I've been running in debug mode all day.  So far, here
are the summaries:

      3 autolearn=ham
   1820 autolearn=no
   1898 autolearn=spam
    403 autolearn=unavailable

If I sort the debug scores in ascending order, here are the top ten
messages of thousands:

May 22 05:15:32 smtp spamd[6369]: learn: auto-learn? yes, ham (-6.327 < -1)
May 22 01:18:22 smtp spamd[22597]: learn: auto-learn? yes, ham (-4.299 < -1)
May 22 12:05:31 smtp spamd[16552]: learn: auto-learn? yes, ham (-4.299 < -1)
May 22 04:57:45 smtp spamd[6369]: learn: auto-learn? yes, spam (12 > 12)
May 22 13:16:01 smtp spamd[21713]: learn: auto-learn? yes, spam (12.009 > 12)
May 22 03:37:43 smtp spamd[1066]: learn: auto-learn? yes, spam (12.049 > 12)
May 22 05:42:27 smtp spamd[9085]: learn: auto-learn? yes, spam (12.057 > 12)
May 22 04:36:11 smtp spamd[4877]: learn: auto-learn? yes, spam (12.066 > 12)
May 22 05:35:27 smtp spamd[9085]: learn: auto-learn? yes, spam (12.066 > 12)
May 22 05:25:51 smtp spamd[6369]: learn: auto-learn? yes, spam (12.092 > 12)
        [ thousand of spam lines deleted ]

The first three lines are the only autolearned ham.  That's it.  All day.
There were thousands of hams which were not learned.  Notice the quantum
leap between -4.299 and 12 in the score used by auto-learn (which is
not the same as the final spamassassin score).  Why are there no scores
in this range?!?  It is as something has added 12 to 16 points to all
but three messages.  My current efforts involve learning why this is
happening.  If you have any clues, I am all ears.

>I don't run with auto learn anymore. I tried using it site wide and even 
>with extremely loose values, things still got out of whack. Spam was 
>learned as ham and visa versa. It was a nightmare to keep track and 
>correct. With the traffic I see, auto learn site wide will have learned 
>the default amount for bayes to start filtering within six hours.

Yeah, I'll do this, too, if I have to. But autolearning worked fine
for years until last week. And I can't think of any changes I made
to the mail system during that period.  Weird.

Fletcher

Re: Bayes problem: very large spam/ham ratio

Reply via email to