Re: Autolearning from rules rather than score

Karsten Bräckelmann Wed, 28 Jan 2009 15:09:29 -0800

On Wed, 2009-01-28 at 22:36 +0000, RW wrote:
> On Wed, 28 Jan 2009 22:02:59 +0100
> Karsten Bräckelmann <guent...@rudersport.de> wrote:

> > > On Wed, 28 Jan 2009, RW wrote:
> > > 
> > > > I was wondering if it's possible to control autolearning  based
> > > > on rules.
> > 
> > No. And even tweaking the various thresholds will not help, since
> > auto-learning is based on the score *without* Bayes, etc.
> > 
> > > > I'm scoring DSPAM into Spamassassin, and since DSPAM autolearns 
> > 
> > By that you mean... Using the DSPAM plugin for SA? And the rule you
> > want to base auto-learning upon is the DSPAM plugin one?
> 
> No, is there any point?

Err, then I don't understand the "auto-learning from rules" in your
Subject. What do you mean by that?

> I just pass it though dspam and then score like this:
[...]
> I combine this with some sieve rules that file into Junk and Junk.high
> folders at the scores 5 and 30. Junk.high is effectively discarded. I
> check the Junk folder and move everything to the  training folders,
> along with any spam that gets through. Additionally a sieve rule
> autofiles anything over 30 that dspam didn't get into the learn-spam
> folder. 
> 
> That means that every single mail  misclassified by dspam's
> autolearning will get reclassified, but it doesn't imply the same for
> Bayes unless Bayes autolearns in line with dspam.  

You got a special handling of mails dspam missed. So you are right, it
won't do the same for SA Bayes, unless you get in some equivalent
special handling for SA...

> However, thinking about it a bit more, I think that the only real
> problem is that ham that scores between 0.1 and 5.0
> wont be learned as ham, and I can fix that by moving the autolearn
> threshold to up to 4.9.

Eek!  No, this is wrong and gives me the creeps.

As I've mentioned before (hey, see your quote :), certain rules like
Bayes will NOT be taken into account for the threshold. Also, scores
used for auto-learning evaluation are using a non-Bayes score set. See
the docs.

http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

That means that a mail scoring your threshold PLUS the BAYES_00 score
can be learned as ham. Possibly even much higher, since auto-learn uses
score set 0 or 1... Think about this for a moment.

AFAIK, there is no clean way of tricking SA into learning *everything*
above and below a given threshold.

Also, a certain gray area is better not learned automatically.
Seriously. False learning *immediately* will have an impact on further
results. Whereas learning after a manual re-view is slower, but not
affected by bootstrapping even more FNs and FPs out of its own ass.

> BTW am I correct in assumimg that my dspam header rules
> in /usr/local/etc/mail/spamassassin/local.cf will contribute to
> autolearning.

Yes. Unless you set tflags noautolearn for your rules. See the above
doc, and section Rule Definitions here:
  http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Conf.html

  guenther

-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Autolearning from rules rather than score

Reply via email to