On Thu, 29 Jan 2009 00:08:57 +0100
Karsten Bräckelmann <guent...@rudersport.de> wrote:

> On Wed, 2009-01-28 at 22:36 +0000, RW wrote:
> > On Wed, 28 Jan 2009 22:02:59 +0100
> > Karsten Bräckelmann <guent...@rudersport.de> wrote:

> > > By that you mean... Using the DSPAM plugin for SA? And the rule
> > > you want to base auto-learning upon is the DSPAM plugin one?
> > 
> > No, is there any point?
> 
> Err, then I don't understand the "auto-learning from rules" in your
> Subject. What do you mean by that?

I meant have Bayes learn from  the DSPAM header rules that I quoted.
What does the plugin actually do that simply piping mail though DSPAM
before SA doesn't?

> > However, thinking about it a bit more, I think that the only real
> > problem is that ham that scores between 0.1 and 5.0
> > wont be learned as ham, and I can fix that by moving the autolearn
> > threshold to up to 4.9.
> 
> Eek!  No, this is wrong and gives me the creeps.

That's probably because you're not seeing the big picture yet.

> As I've mentioned before (hey, see your quote :), certain rules like
> Bayes will NOT be taken into account for the threshold. Also, scores
> used for auto-learning evaluation are using a non-Bayes score set. See
> the docs.
>   
> http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html
> 
> That means that a mail scoring your threshold PLUS the BAYES_00 score
> can be learned as ham. Possibly even much higher, since auto-learn
> uses score set 0 or 1... Think about this for a moment.

That perfectly OK, because it will neccessarily get caught as spam and
I'll reclassify it manually as a matter of course.

> AFAIK, there is no clean way of tricking SA into learning *everything*
> above and below a given threshold.
> 
> Also, a certain gray area is better not learned automatically.
> Seriously. False learning *immediately* will have an impact on further
> results. Whereas learning after a manual re-view is slower, but not
> affected by bootstrapping even more FNs and FPs out of its own ass.

Don't forget that the -2.5 score of DSPAM does contribute to
autolearning but BAYES_* doesn't, so it's not going to runaway in the
FP direction. And anyway very few emails score in the region -15 to
+20, with the rules I quoted. It's just that I think they may be
particularly important to learn.

For the most part I keep the Bayes rules just for the meta rule:

meta     DS_HAM_FULL  DS_HAM && (BAYES_00 || BAYES_05)
score    DS_HAM_FULL  -15.0

Which gives almost all ham a score of around -20, but doesn't prevent
SA from catching spams that DSPAM misses. 

DSPAM produces yes/no results, and works on an "innocent until proven
guilty" basis, so it occasionally classifies as ham when it's really
just uncertain. If I simply scored DS_HAM at -20, I'd effectively
prevent SA for deciding anything other than which Junk folder the spam
go into.

DSPAM is a *much* better statistical filter overall, but I've noticed
that however good spammers get at obfuscating spam, simple Bayesian
filters remains highly effective at  identifying almost all of my ham
beyond reasonable doubt - unidentified email clusters around 0.5 not
0.0 as it does with DSPAM. This makes DSPAM and SA's Bayesian filter
complementary. If the latter actually identifies spam it's no more than
a bonus.

Reply via email to