On Tue, 2009-03-03 at 23:16 +0100, mouss wrote: > I finally disabled Bayes, because I think it doesn't bring me what I want:
Works really well for me. Quick guesstimate is 99% of my spam hits BAYES_80 or higher, most of them 95+. Ham typically scores 00, IIRC almost always below 50. And no these numbers are not made up. :) This is per-user, err, human, though. Multiple addresses, single me. > - train on error doesn't seem enough, and I can understand it Agreed. Of course I do that, but read on... > - train on everything isn't reasonable. even myself wouldn't do that, > because while I can see spam and feed sa, I don't check all my mail to > be sure the messages I didn't see are ham. Definitely. I do NOT even bother to scan, let alone train mailing posts, bugzilla bulk, etc. These are filtered early. It's pretty much Inbox (direct, personal mail) or spam here. I developed some special habits for training long ago. First of all, auto-learn *is* enabled. For ham. The auto-learn spam threshold is way up to never trigger, effectively disabled. I do train all my non-auto-learned ham -- occasionally. That's like once or twice a year... I'm too lazy. Cause I do paranoidly review the ham before learning. Got a ham backup folder for that, populated automatically. Auto-learning ham generally performs just great for me. Then I do learn spam manually. Aided by mail-filters. For example, all 16+ scoring spam with low-ish Bayes scores below 80 are getting dumped to a copy folder, for quick review, training and flaming. Every now and then I do train lower scoring spam, too. Funnily enough, these usually tend to score high on Bayes anyway, there are other hits missing for a solid 15. FWIW, I am likely to eventually implement *my* flavor of auto-learning low scorers to be done automatically while they come in. Why I do it that way? Easy. There's no way I can hold up to learning 800 spams a day. Don't get that many hams. Remember, ham == Inbox here, no mailing lists, etc. So I don't bother training the lions share that easily triggers 95+ anyway. It's basically an attempt to limit learning spam, to not bias my Bayes beyond necessity. Performs really well for me for years. > - it's too fragile in my opinion. and I got to this conclusion a lot > time ago when testing dspam. By fragile, I mean that it depends too much > on how/when/... you train it I mentioned I plan to implement my selective auto-learning flavor. This is related to exactly this. Some days I wake up and find a bunch of "new" spam that slipped through. Bummer. After learning these, I'll often never see one of their type again -- known to Bayes, end up in the header-logging folder... This implies, that I plan to switch to "train 10+ scoring spam with low-ish Bayes automatically". I am prepared to UN-learn FPs. Have never ever seen one with that score anyway. :) And the remaining < 0.5% that scores below 10 is easy enough to train manually. > - in a site wide setup, it's hard to come up with a "serious" system > (get feedback but stay safe against dumb users) > > - in a per user setup, you get the storage cost. but that's not all: > you're just ignoring the problem. lusers can't/don't train bayes... > > of course, if I'm writing this, it's to get opinions. There you go. :) Hope you like it. Hey, I just leaked my s3creet training strategy! ;) guenther -- hrm, I just know there was something I wanted to add, but just forgot... -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}