At 11:23 PM 9/24/03 -0500, Jeremy M. Dolan wrote:
There's no further information on the auto-learning in the sa-learn or
spamassassin man pages, or on the web page. What gives?

man Mail::SpamAssassin::Conf should describe the autolearning by describing the options that control it.




Also: are you aware that 80%-probable spam is assigned a significantly
higher default score (5.3) than 99%-probable (4.0)? Genetic algorithm
or no, that doesn't seem statistically healthy. If giving
90-99%-probable spam an EQUAL or higher score than 80-90%-probable
spam receives is causing false positives, wouldn't that point to a
flaw in the Bayes filtering theory or implementation?

This is a common over-simplification of the system. People often tend to think of one rule at a time, and assume that a stronger bayes score must get a higher number of points. However, when reality enters the picture, this is no longer a sensible viewpoint, and it's not how the GA works.


You see, rules are not scored based on their own individual behaviors.. they are scored based on their interactions with every other rule in the ruleset. The GA tunes the scores of rules such that the combination of hits they make puts as many messages in the right piles as is possible.

A spam message with such a high bayes score is also likely to trigger a large number of spam rules, and thus does not require a massive score boost from the BAYES rule. However, nonspam messages that are wildly mis-classified because they are dirty jokes or legitimate financial newsletters are likely to wind up deep into spam territory as well, and not merely at the 80 mark. Admittedly such messages are rare, but they need a bit more slack.

Thus, bayes has always been non-linear in score, and the predecessor to it, spam_phrases, also had non linear scores. This has never been a bug, and never will be, it's merely a reflection of the fact that the reality of how SA works is significantly more complex than it may seem at a casual glance. SA's scoring is a reflection of the interactions of all of the hundreds the rules, when given real-world input.

Never underestimate the complexity of a system with literally hundreds of variables trying to characterize things which are based on human behavior.



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to