Re: Bayes problem: very large spam/ham ratio

Fletcher Mattox Tue, 22 May 2007 17:49:25 -0700

Andrzej Adam Filip writes:
>Fletcher Mattox wrote:
>> Hi,
>> 
>> After years of stability, my bayes db is doing poorly.  When I first
>> noticed it, it was classifying lots of ham BAYES_99, I cleared the db
>> and started over.  Now it finds *very* few ham.
>> 
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0      14779          0  non-token data: nspam
>> 0.000          0         86          0  non-token data: nham
>> 0.000          0     231925          0  non-token data: ntokens
>> 0.000          0 1177142672          0  non-token data: oldest atime
>> 0.000          0 1179789654          0  non-token data: newest atime
>> 0.000          0 1179789681          0  non-token data: last journal sync 
>> atime
>> 0.000          0 1179761284          0  non-token data: last expiry atime
>> 0.000          0      43200          0  non-token data: last expire atime 
>> delta
>> 0.000          0      90881          0  non-token data: last expire 
>> reduction count
>> 
>> I've seen people report large spam/ham ratios on this list, but this
>> seems extreme,  >170:1.  So I added about 500 ham (I am sure of the
>> quality) to the db with "sa-learn --ham", hoping that would help.
>> But it is still behaving poorly, over 20% of my ham is BAYES_99.
>> (Normally less the 1% of my ham is BAYES_99.)
>> 
>> Does anyone know why my system can't find any ham?  It's a fairly typical
>> university site of about 10000 messages/day with a 50/50 ham/spam ratio,
>> so I know it is receiving plenty of ham.  Running 3.2.0 if it matters.
>
>1) Does you MTA (mail server) use DNSBL lists to block spam?
>   Which lists does it use? [abuse sources, DUL]
>2) Do you use greylisting?
>   [in combination with CBL.abuseat.org or a list containing it]
>
>Spamassassin is an effective but costly tool for spam defense.
>It should be used as *the second* line of spam defenses after deploying
>less effective but much less costly defenses such as DNSBL lookups at
>MTA level. Such deployment scheme should reduce spam/ham ratio seen by
>spamassassin.


Actually, SA is my third or fourth line of defense, including both
greylisting and DNSBL lists.  While I did not explicitly state this in my
original mail, you could have deduced it from my "50/50 ham/spam ratio".
That ratio is way too high for an unprotected mail server these days.
It was 10/90 ham/spam before greylisting (our first line).

Fletcher

Re: Bayes problem: very large spam/ham ratio

Reply via email to