Re: Bye Bye Bayes

Karsten Bräckelmann Tue, 03 Mar 2009 15:00:57 -0800

On Tue, 2009-03-03 at 23:16 +0100, mouss wrote:
> I finally disabled Bayes, because I think it doesn't bring me what I want:


Works really well for me. Quick guesstimate is 99% of my spam hits
BAYES_80 or higher, most of them 95+. Ham typically scores 00, IIRC
almost always below 50. And no these numbers are not made up. :)

This is per-user, err, human, though. Multiple addresses, single me.


> - train on error doesn't seem enough, and I can understand it

Agreed. Of course I do that, but read on...

> - train on everything isn't reasonable. even myself wouldn't do that,
> because while I can see spam and feed sa, I don't check all my mail to
> be sure the messages I didn't see are ham.

Definitely. I do NOT even bother to scan, let alone train mailing posts,
bugzilla bulk, etc. These are filtered early. It's pretty much Inbox
(direct, personal mail) or spam here.

I developed some special habits for training long ago.

First of all, auto-learn *is* enabled. For ham. The auto-learn spam
threshold is way up to never trigger, effectively disabled. I do train
all my non-auto-learned ham -- occasionally. That's like once or twice a
year... I'm too lazy. Cause I do paranoidly review the ham before
learning. Got a ham backup folder for that, populated automatically.
Auto-learning ham generally performs just great for me.

Then I do learn spam manually. Aided by mail-filters. For example, all
16+ scoring spam with low-ish Bayes scores below 80 are getting dumped
to a copy folder, for quick review, training and flaming.

Every now and then I do train lower scoring spam, too. Funnily enough,
these usually tend to score high on Bayes anyway, there are other hits
missing for a solid 15.

FWIW, I am likely to eventually implement *my* flavor of auto-learning
low scorers to be done automatically while they come in.


Why I do it that way?  Easy. There's no way I can hold up to learning
800 spams a day. Don't get that many hams. Remember, ham == Inbox here,
no mailing lists, etc.  So I don't bother training the lions share that
easily triggers 95+ anyway.

It's basically an attempt to limit learning spam, to not bias my Bayes
beyond necessity. Performs really well for me for years.


> - it's too fragile in my opinion. and I got to this conclusion a lot
> time ago when testing dspam. By fragile, I mean that it depends too much
> on how/when/... you train it

I mentioned I plan to implement my selective auto-learning flavor. This
is related to exactly this.

Some days I wake up and find a bunch of "new" spam that slipped through.
Bummer. After learning these, I'll often never see one of their type
again -- known to Bayes, end up in the header-logging folder...

This implies, that I plan to switch to "train 10+ scoring spam with
low-ish Bayes automatically". I am prepared to UN-learn FPs. Have never
ever seen one with that score anyway. :)  And the remaining < 0.5% that
scores below 10 is easy enough to train manually.


> - in a site wide setup, it's hard to come up with a "serious" system
> (get feedback but stay safe against dumb users)
> 
> - in a per user setup, you get the storage cost. but that's not all:
> you're just ignoring the problem. lusers can't/don't train bayes...
> 
> of course, if I'm writing this, it's to get opinions.

There you go. :)  Hope you like it.  Hey, I just leaked my s3creet
training strategy! ;)

  guenther  -- hrm, I just know there was something I wanted to add,
               but just forgot...

-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bye Bye Bayes

Reply via email to