Re: Bayes, Manual and Auto Learning Strategies

Karsten Bräckelmann Tue, 01 Jul 2014 19:54:06 -0700

On Tue, 2014-07-01 at 20:36 -0500, Steve Bergman wrote:
> On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
> >
> > That's pretty bad practice. Fundamentally, you are implementing a custom
> > auto-learn flavor, overruling the SA configurable auto-learn behavior
> 
> SA's autolearn behavior doesn't make much sense. I have no confidence in it.


The auto-learning feature is NOT meant to be a fully automated training
system. It's an aid for the user to eliminate the need to care about the
extremes, while focusing on the close-calls. There are options to tweak
to your specific needs, and there even is no single "SA autolearn
behavior" as you stated, but different flavors. And an option to turn it
off.

Frankly, it appears you don't understand what auto-learning is.

> This method shields the user from the worst of the spam, while giving 
> them full control of what gets relearned as spam.

Wrong. It is not "this" (your) method, that shields the user from the
worst of the spam. That's SA. Not your style of auto-training.

And unless you disabled Bayes auto-learning in SA (dunno, might have
been mentioned deep in the thread), the user does not have full control
of what gets relearned as spam.


> > and ignoring all safety concepts implemented by SA.
> 
> What safety concepts? autolearn is a complete joke. Even the docs 
> explain that it's only there as a last resort method of kinda sorta 
> training the spam filter.

You are doing (custom) auto-learning as ham of any message with a score
less than required_score of 5.0. *That* is a joke.

(Besides, you *are* doing auto-learning, which you just claimed to be a
complete joke.)

At this point I won't get into details. It should suffice to highlight
that a default ham auto-learning threshold of 0.1 is part of the safety
concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)


> > So if a user in a hurry simply deletes some spam, it will remain ham, as
> > far as Bayes is concerned.
> 
> Same as with Thunderbird, I think.

I never checked the TB internal Bayes implementation and auto-learn
strategy, but I'd be surprised if they do train on black/white, without
any gray area in between.

You stated it. Please back up your claim.


> And it's working very well for them. 
> If they act irresponsibly, they'll get more spam. It takes no longer to 
> highlight the spam and click "Junk" than it does to highlight the spam 
> and click "Delete".

While I am aware I'm not the average user -- there's a "delete" action
key on my keyboard. There's no "junk" equivalent. Yes, I avoid using the
mouse if keyboard interaction is more productive...


> I've pretty much decided at this point that if the users don't do what I 
> tell them to do, repeatedly, then what results is not my responsibility.
> 
> And it's not.

Do you hate your users or your job? (Sorry, snide-remark I couldn't
resist. Feel free to ignore.)

> The alternative is to not mark incoming mail as ham, and allow the SA 
> Bayesian filter to remain inactive forever.

No. I can only guess, but it appears there are some mis-interpretations
in that conclusion.

The SA Bayesian classifier to "remain inactive forever" can only refer
to insufficient initial training. Manual training. Of at least 200 ham
and spam each (by default, you can lower that to 0). You will easily get
that by manual training of existing messages. And even default auto-
learning would eventually cross the ham number. Less than forever.

More importantly, SA still marks (classifies) incoming mail as ham. Just
because its overall score is less than 5.0. It just does not *learn* all
of them as ham. Because there's a chance it might not actually be ham,
but a FN.

That area, between (default) auto-learning as ham and classifying as
spam is the gray area, where actual user input is of much value. For
both, learning spam AND ham, for that matter. In particular, because
generally (and as SA principle), a FP is *much* worse than a FN.


Your approach of force learning those as ham, is biasing your Bayes DB.
At the very least temporarily (unless a fresh spam campaign has been
re-trained by your users on Monday). At worst, until you clear it.

Btw, is that per-user, or are you gambling a site-wide Bayes DB?


> I opted to give the users the choice of being responsible for sorting, 
> and reaping the benefits of that if they do. And yes, I know that some 
> are not going to.
> 
> I'd be interested if you have a better solution in mind.

Do not auto-learn ham every message that scores below required_score.

Introduce train-on-error for your users, with an extended manual
training option. Specific ham and spam folders, where moving or copying
mail into trains the Bayes classifier. Kind of optional for the user,
unless they feel there's too much mis-classification.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes, Manual and Auto Learning Strategies

Reply via email to