Re: [SAtalk] BAYES_XX scores

Matt Kettler Fri, 11 Jul 2003 13:01:32 -0700

At 12:56 PM 7/11/2003 -0500, Genchev, Sergei wrote:

 I use bayes tests and do not use network tests with SA 2.55. Thing that
puzzles me is the default scores for my situation.

 Is there any reason that BAYES_80 score (5.3) is bigger then BAYES_90 score
(4.027) and even BAYES_99 score (5.2)? BAYES_10 vs. BAYES_01 vs. BAYES_00
also look strange.
 As I understand it, bayes gets you a probability that an e-mail is spam.
Why then .9 probability gets less weight than .8 probability?
 If somehow with the current bayes implementation the .9 and bigger score is
more doubtful then .8 then why the same bayes scores when using network
tests steadily going up?
 Can anybody shed a light on it?

You're over-simplifying the system.. The scores would likely be linear if the BAYES rules were the only rules in the entire ruleset.

However, that's not the case, there's hundreds of other rules in the ruleset. The scores assigned to rules are not just a function of the rule and how much spam it matches. They are really a function of the rule AND what combinations other rules also match the same messages. This is the beauty of what the GA does.. it analyzes a very complex set of patterns and assigns scores which are a "best fit" to real-world data.

Emails which score very high in bayes are also likely to be emails that are super-obvious to the default ruleset and will score high without a high score assigned to the bayes_90. However emails coming in at 80 are more likely to be "sneaky" mails that don't match as many rules in the default rulset, so the extra score might be necessary.

Really, you'd have to rsync out the mass-check data and spend about a week analyzing it all by hand to figure out the exact reasons why the GA laid the score that way.

I know I'm too lazy to do all that work by hand, but suffice to say, it's not reasonable to expect simple linear score assignments from an inherently complex system of hundreds of inter-related rules which gets real-world data as learning input to a "best fit" genetic algorithm score assignment. These things which look "wrong" to the simplified view quickly turn out to be "right" in most cases when you start looking at the bigger picture.

It's not entirely wrong to question the score assignments of the GA, but you certainly need to do so from the perspective of SA as a whole system, not just an individual rule or subset of rules.

If you dig back in the archives, this exact same question has been asked many times about SPAM_PHRASES in older versions. (spam_phrases was really a lot like a super-simplified bayes that had a fixed token database)


-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps1
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] BAYES_XX scores

Reply via email to