On Thu, 29 Jan 2009 00:08:57 +0100 Karsten Bräckelmann <guent...@rudersport.de> wrote:
> On Wed, 2009-01-28 at 22:36 +0000, RW wrote: > > On Wed, 28 Jan 2009 22:02:59 +0100 > > Karsten Bräckelmann <guent...@rudersport.de> wrote: > > > By that you mean... Using the DSPAM plugin for SA? And the rule > > > you want to base auto-learning upon is the DSPAM plugin one? > > > > No, is there any point? > > Err, then I don't understand the "auto-learning from rules" in your > Subject. What do you mean by that? I meant have Bayes learn from the DSPAM header rules that I quoted. What does the plugin actually do that simply piping mail though DSPAM before SA doesn't? > > However, thinking about it a bit more, I think that the only real > > problem is that ham that scores between 0.1 and 5.0 > > wont be learned as ham, and I can fix that by moving the autolearn > > threshold to up to 4.9. > > Eek! No, this is wrong and gives me the creeps. That's probably because you're not seeing the big picture yet. > As I've mentioned before (hey, see your quote :), certain rules like > Bayes will NOT be taken into account for the threshold. Also, scores > used for auto-learning evaluation are using a non-Bayes score set. See > the docs. > > http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html > > That means that a mail scoring your threshold PLUS the BAYES_00 score > can be learned as ham. Possibly even much higher, since auto-learn > uses score set 0 or 1... Think about this for a moment. That perfectly OK, because it will neccessarily get caught as spam and I'll reclassify it manually as a matter of course. > AFAIK, there is no clean way of tricking SA into learning *everything* > above and below a given threshold. > > Also, a certain gray area is better not learned automatically. > Seriously. False learning *immediately* will have an impact on further > results. Whereas learning after a manual re-view is slower, but not > affected by bootstrapping even more FNs and FPs out of its own ass. Don't forget that the -2.5 score of DSPAM does contribute to autolearning but BAYES_* doesn't, so it's not going to runaway in the FP direction. And anyway very few emails score in the region -15 to +20, with the rules I quoted. It's just that I think they may be particularly important to learn. For the most part I keep the Bayes rules just for the meta rule: meta DS_HAM_FULL DS_HAM && (BAYES_00 || BAYES_05) score DS_HAM_FULL -15.0 Which gives almost all ham a score of around -20, but doesn't prevent SA from catching spams that DSPAM misses. DSPAM produces yes/no results, and works on an "innocent until proven guilty" basis, so it occasionally classifies as ham when it's really just uncertain. If I simply scored DS_HAM at -20, I'd effectively prevent SA for deciding anything other than which Junk folder the spam go into. DSPAM is a *much* better statistical filter overall, but I've noticed that however good spammers get at obfuscating spam, simple Bayesian filters remains highly effective at identifying almost all of my ham beyond reasonable doubt - unidentified email clusters around 0.5 not 0.0 as it does with DSPAM. This makes DSPAM and SA's Bayesian filter complementary. If the latter actually identifies spam it's no more than a bonus.