Re: Bayes advanced questions

Matt Kettler Wed, 10 May 2006 23:09:06 -0700

Michael Monnerie wrote:
> On Mittwoch, 10. Mai 2006 23:41 Matt Kettler wrote:
>   
>> Particularly on servers with a site-wide DB used against broadly
>> diverse spread of mail, increasing the token limit will improve
>> accuracy.
>>
>> However, this comes at the expense of increased storage needs and
>> slower performance. (In particular, expiry takes a LOT longer with
>> larger DBs)
>>     
>
> DB Files are about 60MB together, so not really big (I just got a 
> pricelist with the new 750GB SATA drive from Seagate *g*).
>
> And tonights expiry for server #1:
> bayes: synced databases from journal in 11 seconds: 1968 unique entries 
> (3059 total entries)
>


That's the journal sync, not the expiry part. The expiry part takes much
longer.
> So it's not too long also. Could possibly be longer on a server that 
> gets some million mails per day, of course.
>
>   
>> score used is the score the message would have got if:
>>      bayes was disabled
>>      the AWL was disabled
>>      no userconf (ie:black/whitelists) rules were enabled.
>>     
>
> Thats good info which should be in the man page.
>   

It is.. In SA 3.1.x it's in the docs for the autolearn threshold plugin:

http://spamassassin.apache.org/full/3.1.x/dist/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

>   
>> Since that message scored 8.7, and derives 3.5 of it's points from
>> BAYES_99, it does not surprise me at all the message was not learned.
>>
>> Also, EVEN if the learning score is over the threshold, SA will not
>> learn a message as spam unless:
>>      there are at least 3.0 points of header rules
>>      there are at least 3.0 points of body rules
>>      Existing learning would not place the message in a low bayes
>> category (ie: don't learn as spam if the message would have hit
>> BAYES_00 otherwise)
>>     
>
> This is written in the man page, except the last line with the BAYES_00 
> wasn't clear to me from there. Is this valid just for BAYES_00 and 
> BAYES_99, or also BAYES_05 and BAYES_95? 
>   
I looked into the code for SA 3.1.0's PerMsgStatus.pm  and
Plugin/AutoLearnThreshold.pm.

The limitation is actually done by computing score of the bayes rules,
not the actual bayes percentage.

Learning as ham will be inhibited if the score of the "learn" rules (ie:
bayes) totals more than +1.0.
Learning as spam will be inhibited if e score of the "learn" rules (ie:
bayes) totals less than -1.0.

Note: by "learn" rules, I mean rules declared with the "learn" tflag,
which at this time is just bayes.

So in SA 3.1.0, existing training ranking BAYES_00 and BAYES_05 will
inhibit spam learning.
BAYES_60 or higher will inhibit ham learning.
>   
>>> Since it's already BAYES_99, you could say
>>> "don't bother, I'll be fine" *g* but bayes needs to be trained
>>> permanently, because tokens time out...
>>>       
>> Also realize that just because the message got BAYES_99 doesn't mean
>> there are no tokens in it that can be learned from. Spam mutates. New
>> phrases and words creep in. These need to be learned from, even if
>> the current message is already BAYES_99.
>>     
>
> Yes, this is very valuable info for others also I believe.
>
> Thanks for your help on this,
> mfg zmi
>

Re: Bayes advanced questions

Reply via email to