On Tue, 27 Oct 2009 15:01:39 +0100 Sam <liste-spamassas...@ingescom.com> wrote:
> RW a écrit : > > On Tue, 27 Oct 2009 13:33:14 +0100 > > If you find it surprising that that can happen, you don't understand > > how Bayes works. It's a leaning system that's intended to classify > > mail it hasn't seen based on mail it has seen. > > > > > I agree with you for non-seen mail. But after learning with sa-learn > I thought bayes should increase over Bayes_50 for the same learned > message. Most mails contain a number of hapaxes, one-off tokens that are never seen again. If you train on a mail and then retest, hapaxes and other rare tokens often skew the result to produce a positive match; this is why sometimes a retest will score BAYES_99, but an almost identical spam will hit BAYES_50. On some retests the hapaxes don't dominate on retesting and the probability stays close to .5. Like many such filters BAYES clusters strongly around 0, 0.5 and 1. If it allowed you to retrain to exhaustion (which it doesn't) you would probably see several BAYES_50 results followed by a step change to BAYES_99. Check that you haven't set "bayes_use_hapaxes 0". Otherwise if you are seeing a lot of trained mails hit BAYES_50 on retesting (and I mean 10% or so) you may have a mistrained database. If you only see a few, forget about it.