Wow!  Matt this is an incredibly informative post.  Thank you!

-----Original Message-----
From: Matt Kettler [mailto:[EMAIL PROTECTED]
Sent: Friday, November 07, 2003 12:43 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [SAtalk] scoring system and values...


At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:

>Sorry if this has been discussed in the past...

It's been discussed many times.. It's very common for people to have a very 
deep misunderstanding of how SA scoring works. Most people fall into the 
trap of over-simplifying the problem, and simply assuming that some rule or 
another "must" be a good spam rule, when in fact it's not.

>Of course this is open to debate, but then again that's all I want;
>possibly a debate about how accurate the scoring is right now...

That's fine.. but in the next round you're going to have to do a LOT more 
homework.. you're over-simplifying things by merely looking at the name of 
the rule... You're not looking at it's performance levels, it's impact on 
nonspam, or it's interactions with other rules.

Questioning the accuracy of the scoring system isn't unreasonable.. but the 
scoring system is VASTLY more complicated than you can understand in a few 
hours of study. You need to have a good understanding of how it really 
works, and just how complicated the balance of the scoring system is before 
you can make reasonable judgements about accuracy.

You need to realize the SA scoring system is somewhat analogous to curve 
fitting an equation with 873 variables (there are 873 rules in SA 2.60's 
50_scores.cf). This is done as an approximation using a genetic algorithm 
to evolve a solution, since a direct solution would take too long to 
compute. Trying to get your mind completely around an equation with that 
many variables is not possible for most humans, including me, but I've 
learned to understand and respect how complex the problem is.


>List 1:
>score ALL_CAP_PORN 0.650 0.669 0 0
>score PENIS_ENLARGE2 0.500 0.590 0 0.501
>score UPPERCASE_50_75 0.794 1.137 0 0
>score V+AG+A_ONLINE 1.100 1.101 3.151 4.056
>
>If it were up to me, I'd say that giving only half a point to a mail that
>scores PENIS_ENLARGE2 is...  well, ludicrous.  Let's not kid ourselves.
>IF there are people who participate on a genuine mailinglist that
>discusses penis enlargement, let the burden fall on them to put those
>adresses in their whitelist, not the reverse.

OK, being that it's not up to you, let's look at the real-world performance 
of these rules from STATISTICS.txt

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   1.010   1.5010   0.0893    0.944   0.80    0.65  ALL_CAP_PORN
   2.962   4.5216   0.0418    0.991   0.93    0.50  PENIS_ENLARGE2
   0.580   0.8552   0.0645    0.930   0.77    0.79  UPPERCASE_50_75
   1.040   1.5930   0.0032    0.998   0.95    1.10  V+AG+A_ONLINE

*yawn*.. none of these rules has particularly impressive hit rates, so they 
aren't very significant in the grand scheme of SA. A meager 4.5% of spam 
hits isn't impressive, although not useless.

Some of them, such as ALL_CAP_PORN and UPPERCASE_50_75 have really bad 
quantities of nonspam hits. Anything with a S/O under 90 pretty much 
doesn't deserve a high score because 10% of the email that the rule matches 
is nonspam. In the case of these two, both have at least 20% of their hits 
being nonspam mail.. ouch.

Quite frankly, UPPERCASE_50_75 performs so badly it doesn't even meet the 
criteria to avoid being dropped from the ruleset, but is probably retained 
for completeness with the other rules. (in general spam rules need to have 
an S/O of .80 or higher to be deemed "worthwhile".. anything less isn't a 
very good indicator of spam and is just a waste of time).

In the case of the other two, you need to start looking at the larger 
ecosystem of the entire ruleset.. SA rules are not scored based on the 
merits of the rule alone.. the entire ruleset is scored together, and the 
scores of all the rules are tuned to try to get the most spam and nonspam 
placed in the proper piles.

Often times the score of a rule is the result of it's interaction with 
other rules. Take our PENIS_ENLARGE2 rule. This rule can quite possibly 
match some nonspam crude joke emails.. Other spam rules will likely match 
these as well, resulting in a high score.

Now, the GA is designed to treat false positives as 100 times worse than 
false negatives, so this is a very drastic situation for the GA. Faced with 
this problem, the proper thing for the GA to do is to try to reduce the 
score of the rule that affects the least amount of the spam pile.. well, 
given that PENIS_ENLARGE2 only matches 4.5% of spam, it's a good candidate 
for reduction.












-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to