Re: sa-learn spam and Bayes_50

RW Tue, 27 Oct 2009 09:33:17 -0700

On Tue, 27 Oct 2009 15:01:39 +0100
Sam <liste-spamassas...@ingescom.com> wrote:


> RW a écrit :
> > On Tue, 27 Oct 2009 13:33:14 +0100

> > If you find it surprising that that can happen, you don't understand
> > how Bayes works. It's a leaning system that's intended to classify
> > mail it hasn't seen based on mail it has seen. 
> >
> >   
> I agree with you for non-seen mail. But after learning with sa-learn
> I thought bayes should increase over Bayes_50 for the same learned
> message.

Most mails contain a number of hapaxes, one-off tokens that are never
seen again. If you train on a mail and then retest, hapaxes and other
rare tokens often skew the result to produce a positive match; this is
why sometimes a retest will score BAYES_99, but an almost identical spam
will hit BAYES_50.

On some retests the hapaxes don't dominate on retesting and the
probability stays close to .5. Like many such filters BAYES clusters
strongly around 0, 0.5 and 1. If it allowed you to retrain to
exhaustion (which it doesn't) you would probably see  several BAYES_50
results followed by a step change to BAYES_99.


Check that you haven't set "bayes_use_hapaxes 0". Otherwise if you are
seeing a lot of trained mails hit BAYES_50 on retesting (and I mean 10%
or so) you may have a mistrained database. If you only see a few, forget
about it.

Re: sa-learn spam and Bayes_50

Reply via email to