On Wed, 6 Aug 2003, Daniel Carrera yowled: > On Wed, Aug 06, 2003 at 09:10:36PM +0300, Harri Pesonen wrote: >> >> This has probably been asked a zillion times, but why so low scores? > > I think that it's just to pick safe defaults. Bayes is only reliable > after it's been well-trained.
The GA probably chose those scores because stuff that hit high BAYES scores also tended to hit so many other rules that it wasn't necessary to give the scores a big hit to push them above 5.0. Bear in mind that the GA does *not* aim for `maximise spam score and minimise nonspam score'. It aims for `maximise %age of spam with score >5.0' and `maximise ^age of nonspam with score <5.0', giving strong >preference to the latter. Looking at the soratios: OVERALL% SPAM% HAM% S/O RANK SCORE NAME 12.424 45.4324 0.0287 0.999 1.00 4.03 BAYES_90 28.841 0.0107 39.6665 0.000 0.99 -5.40 BAYES_01 11.173 0.0015 15.3686 0.000 0.95 -5.30 BAYES_10 4.145 15.1681 0.0052 1.000 0.95 5.30 BAYES_80 9.062 0.0015 12.4644 0.000 0.95 -5.30 BAYES_00 2.299 8.4186 0.0006 1.000 0.94 5.20 BAYES_99 5.878 0.0077 8.0825 0.001 0.94 -4.70 BAYES_20 2.991 10.8226 0.0500 0.995 0.93 2.59 BAYES_70 4.375 0.1164 5.9740 0.019 0.88 -1.07 BAYES_30 2.405 8.0740 0.2766 0.967 0.84 2.00 BAYES_60 0.000 0.0000 0.0000 0.500 0.00 0.00 BAYES_56 0.000 0.0000 0.0000 0.500 0.00 0.00 BAYES_50 0.000 0.0000 0.0000 0.500 0.00 0.00 BAYES_44 0.000 0.0000 0.0000 0.500 0.00 0.00 BAYES_40 Note that BAYES_90 and BAYES_99 actually hit some nonspam; BAYES_90 actually hit more nonspam than did BAYES_80 (although BAYES_80 catches much less spam than does BAYES_90). Therefore, the GA was driven to push the high-confidence Bayes scores down because they were occasionally wrong for legitimate email, and giving BAYES_90 a ludicrously high score was pushing that legit email into the spam range. The GA works *hard* to prevent that. >> I have noticed that SA has missed a couple of mails, score about 4.8, >> even though Bayes gave them 90% or 99% probability. > > I noticed that too. After I found that my SA was well-trained enough to > have a very high accuracy I raised the values for BAYES. Look at rules/STATISTICS-set2.txt (set2 being for the bayes and no-net run). Bayes isn't as brilliant as you think; it does occasionally make a mistake, and the GA pushed its score down accordingly. -- `That sound you hear is configure wailing, "MY PRECIOUSSSSSSSS!" as it overwrites Multilib with Primary.' --- Phil Edwards ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk