Re: Bayes advanced questions

Matt Kettler Wed, 10 May 2006 14:42:29 -0700

Michael Monnerie wrote:
> Dear SA users, I've had an offlist comparison of bayes DBs, and we found 
> some interesting differences. We're trying to find out why bayes on 
> server #1 makes better scores.:
> 
> Server #1 local.cf (SA 3.1.1):


> Server #1 bayes dump:
> 0.000          0      93053          0  non-token data: nspam
> 0.000          0      53428          0  non-token data: nham
> 0.000          0    1261864          0  non-token data: ntokens

<snip>
> 
> Server #2 bayes dump:
> 0.000          0     155791          0  non-token data: nspam
> 0.000          0      80523          0  non-token data: nham
> 0.000          0     129852          0  non-token data: ntokens
> 
> From the numbers I would say that server #2 had learned more spam+ham, 
> but has about 1/10th of tokens. That server is also far less accurate 
> with bayes than server #1. Could the ntokens be the reason? 

Yes, ntokens could be the reason. Another aspect to consider is the training
input. If either server gets manually trained it will have a big leg-up in
accuracy over one that does not.

Particularly on servers with a site-wide DB used against broadly diverse spread
of mail, increasing the token limit will improve accuracy.

However, this comes at the expense of increased storage needs and slower
performance. (In particular, expiry takes a LOT longer with larger DBs)

> With the new SPAM this last weeks, that tries to poison bayes, it could 
> maybe be effective with the default of 150.000 tokens?

Possibly, but unlikely. Realistically SA is largely immune to poisoning due to
the chi-squared combining. The size of the DB should matter very little in terms
of poison.

> 
> 
> Another tip for all: With server #1 setting 
> bayes_auto_learn_threshold_spam     8.00
> you could expect this message to be autolearned:
> 
>> X-Spam-Status: Yes, hits=8.7 required=5.0 tests=BAYES_99=3.5, 
>> HTML_MESSAGE=0.001,HTML_MIME_NO_HTML_TAG=0,HTML_TAG_EXIST_TBODY=0.282, 
>> MIME_HTML_ONLY=0.389,RELAY_DE=0.01,REPLY_TO_EMPTY=0.512, 
>> SARE_FORGED_EBAY=4 autolearn=no bayes=1.0000
> 
> But it is autolearn=no.

I would not expect that message to be autolearned. The score used in checking
thresholds is NOT the same as the final message score. The score used is the
score the message would have got if:
        bayes was disabled
        the AWL was disabled
        no userconf (ie:black/whitelists) rules were enabled.

Since that message scored 8.7, and derives 3.5 of it's points from BAYES_99, it
does not surprise me at all the message was not learned.

Also, EVEN if the learning score is over the threshold, SA will not learn a
message as spam unless:
        there are at least 3.0 points of header rules
        there are at least 3.0 points of body rules
        Existing learning would not place the message in a low bayes category 
(ie:
don't learn as spam if the message would have hit BAYES_00 otherwise)




> This shows, that manual re-feeding SPAM can be 
> effective for your Bayes, because this sure-is-spam would not have been 
> learned automatically. 

Very true. You can definitely get great improvements in accuracy by training
manually.

I personally view the autolearner as a supplement to my own training. It is not
a perfect system.


> Since it's already BAYES_99, you could say 
> "don't bother, I'll be fine" *g* but bayes needs to be trained 
> permanently, because tokens time out...


Also realize that just because the message got BAYES_99 doesn't mean there are
no tokens in it that can be learned from. Spam mutates. New phrases and words
creep in. These need to be learned from, even if the current message is already
BAYES_99.

> 
> And why was SARE_FORGED_EBAY set down to 4? It was so nice at 100+...

If I had to guess, FPs.


> Also, we set bayes_expiry_max_db_size to 50000, and made
> sa-learn --force-expire --sync
> But still those numbers:
>>  0.000          0     242424          0  non-token data: nspam
>>  0.000          0     313252          0  non-token data: nham
>>  0.000          0     134001          0  non-token data: ntokens
> 
> Why are still 134k tokens there?

As per the docs for bayes_expiry_max_db_size, SA will never reduce the bayes DB
below 100,000. No matter how small you set this value.

-----
bayes_expiry_max_db_size (default: 150000)
    What should be the maximum size of the Bayes tokens database? When expiry
occurs, the Bayes system will keep either 75% of the maximum value, or 100,000
tokens, whichever has a larger value. 150,000 tokens is roughly equivalent to a
8Mb database file.
-----

So SA was likely aiming for 100,000. However, SA performs expiry by picking a
"cut-off" age and dropping all the tokens older than that. It keeps stepping
back the cut-off time till it goes under the target token count. It then uses
the previous time.

This effectively prevents bayes from expiring out your whole bayes DB at once if
all the tokens have the same atime.

This time-step approach likely resulted in the 134k tokens. There are likely
>35k tokens all with the same age right at the cut-off mark. Had SA stepped up
the age any further, you'd have ended up with less than 100k tokens.

Re: Bayes advanced questions

Reply via email to