On Thu, 20 Nov 2003, Smart,Dan wrote:

> Is there a reason that the Bayes scoring is NOT a normal distribution from
> 50% to 100%, and negative from 0% to 50%?

Yes, check the [SAtalk] list archives; this may well be a FAQ.

Short answer: all scores including those from Bayes are generated by a
genetic algorithm ("the GA") which cares little for making the scores fit
an ideal curve (normal distribution) or satisfy the consistency
hobgoblins. The GA adjusts scores until FNs and FPs are minimized within
the time and accuracy constraints it's given.

People occasionally speculate on why the GA scores the way it does;
ultimately it doesn't really matter since SA works best on the test
corpus with the scores set the way they are; changing them makes SA
perform worse.

Besides, if SA was more effective with a nice clean normal distribution of
scores for Bayes, don't you think Jason, et al. would ship it that way? :)

If you are going to adjust the scores yourself, your best bet is to run
the GA against your own (large) corpus of ham and spam so the scores are
tuned to the mail your site sees. Adjusting them by hand is almost
guaranteed to make SA perform worse.

Which is not to discourage you; on the contrary, I think that SA is more
effective globally if sites generate their own scores with the GA, if only
because spammers can't be sure what score sets are in use. Any attempts to
weasel around the generic scores of SA will probably get flagged by
someone else's local tuning.


-- Bob

This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
Spamassassin-talk mailing list

Reply via email to