On Tue, 2014-07-01 at 20:36 -0500, Steve Bergman wrote: > On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote: > > > > That's pretty bad practice. Fundamentally, you are implementing a custom > > auto-learn flavor, overruling the SA configurable auto-learn behavior > > SA's autolearn behavior doesn't make much sense. I have no confidence in it.
The auto-learning feature is NOT meant to be a fully automated training system. It's an aid for the user to eliminate the need to care about the extremes, while focusing on the close-calls. There are options to tweak to your specific needs, and there even is no single "SA autolearn behavior" as you stated, but different flavors. And an option to turn it off. Frankly, it appears you don't understand what auto-learning is. > This method shields the user from the worst of the spam, while giving > them full control of what gets relearned as spam. Wrong. It is not "this" (your) method, that shields the user from the worst of the spam. That's SA. Not your style of auto-training. And unless you disabled Bayes auto-learning in SA (dunno, might have been mentioned deep in the thread), the user does not have full control of what gets relearned as spam. > > and ignoring all safety concepts implemented by SA. > > What safety concepts? autolearn is a complete joke. Even the docs > explain that it's only there as a last resort method of kinda sorta > training the spam filter. You are doing (custom) auto-learning as ham of any message with a score less than required_score of 5.0. *That* is a joke. (Besides, you *are* doing auto-learning, which you just claimed to be a complete joke.) At this point I won't get into details. It should suffice to highlight that a default ham auto-learning threshold of 0.1 is part of the safety concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.) > > So if a user in a hurry simply deletes some spam, it will remain ham, as > > far as Bayes is concerned. > > Same as with Thunderbird, I think. I never checked the TB internal Bayes implementation and auto-learn strategy, but I'd be surprised if they do train on black/white, without any gray area in between. You stated it. Please back up your claim. > And it's working very well for them. > If they act irresponsibly, they'll get more spam. It takes no longer to > highlight the spam and click "Junk" than it does to highlight the spam > and click "Delete". While I am aware I'm not the average user -- there's a "delete" action key on my keyboard. There's no "junk" equivalent. Yes, I avoid using the mouse if keyboard interaction is more productive... > I've pretty much decided at this point that if the users don't do what I > tell them to do, repeatedly, then what results is not my responsibility. > > And it's not. Do you hate your users or your job? (Sorry, snide-remark I couldn't resist. Feel free to ignore.) > The alternative is to not mark incoming mail as ham, and allow the SA > Bayesian filter to remain inactive forever. No. I can only guess, but it appears there are some mis-interpretations in that conclusion. The SA Bayesian classifier to "remain inactive forever" can only refer to insufficient initial training. Manual training. Of at least 200 ham and spam each (by default, you can lower that to 0). You will easily get that by manual training of existing messages. And even default auto- learning would eventually cross the ham number. Less than forever. More importantly, SA still marks (classifies) incoming mail as ham. Just because its overall score is less than 5.0. It just does not *learn* all of them as ham. Because there's a chance it might not actually be ham, but a FN. That area, between (default) auto-learning as ham and classifying as spam is the gray area, where actual user input is of much value. For both, learning spam AND ham, for that matter. In particular, because generally (and as SA principle), a FP is *much* worse than a FN. Your approach of force learning those as ham, is biasing your Bayes DB. At the very least temporarily (unless a fresh spam campaign has been re-trained by your users on Monday). At worst, until you clear it. Btw, is that per-user, or are you gambling a site-wide Bayes DB? > I opted to give the users the choice of being responsible for sorting, > and reaping the benefits of that if they do. And yes, I know that some > are not going to. > > I'd be interested if you have a better solution in mind. Do not auto-learn ham every message that scores below required_score. Introduce train-on-error for your users, with an extended manual training option. Specific ham and spam folders, where moving or copying mail into trains the Bayes classifier. Kind of optional for the user, unless they feel there's too much mis-classification. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}