Aaron Grewell wrote:
> Hi Matt, I'm interested in how your setup compares to mine.  I also find
> Bayes very useful, but I haven't gotten it to work as well as what
> you've described.
> 
> 
>>Interesting.. For me, BAYES_99 is right between SURBL and 
>>URIBL in terms of 
>>hits. (And has 98.91% of URIBL's total hits) I find it completely 
>>indispensable.
>>
> 
> 
> Are you using a single site-wide database, or is this a per-user setup?

Single site-wide.. I use mailscanner which does not support per-user, but I'm
not really looking for it.
> 
> 
>>I rarely train manually, except at initial setup where I feed 
>>it a good 
>>base learning. (the autolearner can sometimes go awry if you 
>>don't train 
>>some mail manually before letting it go.)
>>
> 
> 
> The trouble I had with the autolearner was that some spammers would send
> innocuous mail through to raise their scores until Bayes decided they
> were ok, then start spamming.  That was a couple of versions back, does
> that sort of thing no longer work?


Erm, that really shouldn't affect the bayes autolearner.. perhaps you are
thinking of the AWL? I don't run the AWL for this very reason.

>  >>On a day to day basis I mostly feed automatically with a cronjob that 
>>collects mail via spamtraps and hamtraps. I have that coupled with 
>>autolearning that's set a bit differently than the defaults. (IMNSHO, 
>>having a ham learning threshold that's positive is suicide, 
>>but I also have 
>>a large number of small negative-score rules so I can keep my 
>>threshold at 
>>-0.01 and actually autolearn some ham).
>>
> 
> 
> I'd love to make my Bayesian database more effective, is there a doc
> somewhere that describes how you tuned it to your environment?

Not really.. but it's not hard.

Spamtraps and hamtraps:
-----------------------
1) create a secret "hamtrap" email account. Subscribe this account to
newsletters and news feeds that your users typically subscribe to. Do not post
this address around, and don't use "hamtrap" as the account name, it's too 
obvious.

2) create a "spamtrap" account, or several of them. Carefully seed this out in
the body of some Usenet and mailing list postings.

3) create a cron-job that auto-feeds the above mail to sa-learn.

Simple example fragment of the script I use (it keeps a rotating archive of the
past 5 learning sessions):

#!/bin/sh
cd /var/spool/training/

if [ -f /var/spool/mail/spamtrap ]; then
 echo learning spam mailbox - spamtrap
 mv /var/spool/mail/spamtrap .
 /usr/bin/sa-learn --spam --mbox spamtrap
 rm spam/spamtrap.alearn5.gz
 mv spam/spamtrap.alearn4.gz spam/spamtrap.alearn5.gz
 mv spam/spamtrap.alearn3.gz spam/spamtrap.alearn4.gz
 mv spam/spamtrap.alearn2.gz spam/spamtrap.alearn3.gz
 gzip spam/spamtrap.alearn1
 mv spam/spamtrap.alearn1.gz spam/spamtrap.alearn2.gz

 mv spamtrap spam/spamtrap.alearn1
fi

4) Carefully monitor the data being fed for a while (two weeks or so) to make
sure there's no pollution. After it's established you can monitor it less often.


Autolearn adjustment:

1) add  bayes_auto_learn_threshold_nonspam -0.01 to your local.cf

2) create a "bayes_hamlearning.cf" file. Create several simple body text rules
with "catch phrases" from your normal nonspam. Assign these rules very small
negative scores (-0.01 to -0.1). This is generally easier in a corporate
environment, but it can be done in academic too.

body LOCAL_THESIS       /\bThesis\b/i
score LOCAL_THESIS  -0.01

You have to keep the scores small, as you don't want to use these to whitelist
spam mail. You merely want to make mail that would otherwise score 0 earn a
small negative score if it's got some of these phrases in it. It's not perfect,
but it's better than blindly learning everything under 0.5. I feel learning as
ham should be earned, not a default for not hitting any rules at all.

The problem is this requires some customization. This can't be a default setup
of SA as the "catch phrases" vary from place to place, and if there was a
default set of them spammers would be sure to always include them, making them
pointless. You'd effectively have the same thing as the current default, by
avoiding spam rules and existing bayes tokens they can get a message learned.








Reply via email to