Re: [SAtalk] scoring system and values...

maarten van den Berg Fri, 07 Nov 2003 15:01:59 -0800

On Friday 07 November 2003 18:43, Matt Kettler wrote:
> At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:
> >Sorry if this has been discussed in the past...
>
> It's been discussed many times.. It's very common for people to have a very
> deep misunderstanding of how SA scoring works. Most people fall into the
> trap of over-simplifying the problem, and simply assuming that some rule or
> another "must" be a good spam rule, when in fact it's not.


Well...  I do of course understand that filtering spam entails more than 
"kill_anything_suspicious"  ;-)  The problem is deepened by the rule that 
above all you want no false positives, which is very good in and of itself.
The fact that you 'train' spamassassin by looking at a LOT of ham and spam and 
derive your rulesets from that is also (very!) commendable. 
But put yourself in my place. Upon looking at those rules I see al LOT of 
inconsistencies. For instance, I found these rules that have score of zero(!) 
(and these are merely the top of a large iceberg)

score CASHCASHCASH 0
score ADDRESSES_ON_CD 0
score BLANK_LINES_90_100 0
score EJACULATION 0
score HERBAL_V+AG+A 0

One could argue that yelling CASH CASH CASH is a valid sales pitch in a normal 
mail. But hey, are we being realistic here ?  How could anything but spam 
have this property ?  For addresses_on_cd one could argue that it IS possible 
to have such a statement in a regular email (albeit that's already stretching 
it) but then I would retort that although possible it would stand to reason 
to give it at LEAST a score of 0.5 or so, but not _zero_!  And the third, 
well, it could be a misconfigured client, but still, is an email that is 90% 
<thin air> worth of being treated as a valid email?  And the fourth...  of 
course you will find "ejaculation" in many many forums but, again, give it at 
least some low figure but NOT equal zero...    
And...  well I won't even go into the fifth rule... come on ;-)

> >Of course this is open to debate, but then again that's all I want;
> >possibly a debate about how accurate the scoring is right now...
>
> That's fine.. but in the next round you're going to have to do a LOT more
> homework.. you're over-simplifying things by merely looking at the name of
> the rule... You're not looking at it's performance levels, it's impact on
> nonspam, or it's interactions with other rules.

You do not know how much I crosschecked. But I have to admit I'm new to this 
list so yeah, I do understand your criticism.  But I have lloked up a lot of 
those rules, just to be sure what they check on _exactly_.

Besides, I WANT to learn, so if you can point me to older discussions about 
this I would definitely appreciate that. (maybe the approximate month, or a 
subject to look for...) I just haven't been able to find it yet.

> Questioning the accuracy of the scoring system isn't unreasonable.. but the
> scoring system is VASTLY more complicated than you can understand in a few
> hours of study. You need to have a good understanding of how it really
> works, and just how complicated the balance of the scoring system is before
> you can make reasonable judgements about accuracy.

Well, I'll grant you that much although I did study it a fair amount. But 
let's look at another aspect here too. There is not a single rule that scores 
higher than 4.999. That is plain wrong in my book; let's say we encounter the 
word "vicodin" (which is totally absent in the current rules by the way!). 
I would then say "let's score that 5.50 immediately and IF it is a regular 
email it must 'prove' that fact by having 'positive' points like known_mua or 
what have you. I'd say let the burden be on the one guy that IS discussing 
vicodin and let him have those addresses whitelisted...   That might be a 
bold statement but let's be realistic here: there is a WAR going on guys...
Giving "vicodin" the benefit of the doubt is, well, VERY doubtful at best...!

> You need to realize the SA scoring system is somewhat analogous to curve
> fitting an equation with 873 variables (there are 873 rules in SA 2.60's
> 50_scores.cf). This is done as an approximation using a genetic algorithm
> to evolve a solution, since a direct solution would take too long to
> compute. Trying to get your mind completely around an equation with that
> many variables is not possible for most humans, including me, but I've
> learned to understand and respect how complex the problem is.

Hum. Okay.  But keep in mind I DO NOT question 95% of the rules. Only, some 
just stick out like a sore thumb. Like, the nigerian spam thingy. Or, better, 
one I discovered during testing: the word v1agr4 (had to spell it this way 
for this list but I mean in the correct spelling here) in the body text is 
not recognized or tagged. Only if it is spelled with a capital V it gets 
tagged. That is not really okay, is it ? 

> >List 1:
> >score ALL_CAP_PORN 0.650 0.669 0 0
> >score PENIS_ENLARGE2 0.500 0.590 0 0.501
> >score UPPERCASE_50_75 0.794 1.137 0 0
> >score V+AG+A_ONLINE 1.100 1.101 3.151 4.056
> >
> >If it were up to me, I'd say that giving only half a point to a mail that
> >scores PENIS_ENLARGE2 is...  well, ludicrous.  Let's not kid ourselves.
> >IF there are people who participate on a genuine mailinglist that
> >discusses penis enlargement, let the burden fall on them to put those
> >adresses in their whitelist, not the reverse.
>
> OK, being that it's not up to you, let's look at the real-world performance
> of these rules from STATISTICS.txt

Not wanting to be a PITA ;-), I would almost start questioning the statistics 
file cause it seems not to reflect real-life situations. But hey, who am I ? 

> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>    1.010   1.5010   0.0893    0.944   0.80    0.65  ALL_CAP_PORN
>    2.962   4.5216   0.0418    0.991   0.93    0.50  PENIS_ENLARGE2
>    0.580   0.8552   0.0645    0.930   0.77    0.79  UPPERCASE_50_75
>    1.040   1.5930   0.0032    0.998   0.95    1.10  V+AG+A_ONLINE
>
> *yawn*.. none of these rules has particularly impressive hit rates, so they
> aren't very significant in the grand scheme of SA. A meager 4.5% of spam
> hits isn't impressive, although not useless.

That is correct and I respect your view. But if you look at it from the other 
side of the looking glass you might be asking yourself "How many people, 
percentage-wise, would be glad to receive said penis_enlarge2 email ?"
My guess is, not many. Not by a loooooooong shot... What, maybe 0,003 % ??

> Some of them, such as ALL_CAP_PORN and UPPERCASE_50_75 have really bad
> quantities of nonspam hits. Anything with a S/O under 90 pretty much
> doesn't deserve a high score because 10% of the email that the rule matches
> is nonspam. In the case of these two, both have at least 20% of their hits
> being nonspam mail.. ouch.

Well, looking at the figures - IF I read them correctly- the last fourth 
infamous one scores 1.5930 versus 0.0032 which gives a S/O of 0.998. That is 
high by any standard, right ?  But the fact is, that rule only gets scored 
1.100.  What is that ? That's close to nothing...!

> Quite frankly, UPPERCASE_50_75 performs so badly it doesn't even meet the
> criteria to avoid being dropped from the ruleset, but is probably retained
> for completeness with the other rules. (in general spam rules need to have
> an S/O of .80 or higher to be deemed "worthwhile".. anything less isn't a
> very good indicator of spam and is just a waste of time).

I'm not questioning the .80 threshold rule. What I AM questioning is the 
scoring for rules that fall inside (and sometimes fall WELL inside) those 
constraints (the constraints which,  I repeat, I do not question).

Okay, so then I would have to concede that although not being spam per se, 
UPPERCASE_50_75 is "bad email" as opposed to spam (I'm used to the 
netiquette).  Indeed it is not SpamAssassin's place to judge netiquette, but 
still I have a somewhat hard time accepting that. But okay, granted,  being 
anxiously precise I'd have to agree that that is not spam (as such).

> In the case of the other two, you need to start looking at the larger
> ecosystem of the entire ruleset.. SA rules are not scored based on the
> merits of the rule alone.. the entire ruleset is scored together, and the
> scores of all the rules are tuned to try to get the most spam and nonspam
> placed in the proper piles.

Of course. I know. The reason I started writing this in the first place is 
just _because_ I see so many messages that are SO full of spam signs, yet 
invariably score 4.90...  And thus, they fall right through...  :-(( 

> Often times the score of a rule is the result of it's interaction with
> other rules. Take our PENIS_ENLARGE2 rule. This rule can quite possibly
> match some nonspam crude joke emails.. Other spam rules will likely match
> these as well, resulting in a high score.

In theory yeah. In practice, I find it rather lacking.  
You are of course right in one aspect, which is that if you really _demand_ 
that a joke containing all of the words vicodin, mortgage and penis, and is 
set in all caps is NOT marked as spam (cause it factually isn't), then you 
are right. But at the same time you will notice you have lost the spam war...  
And is that bad joke really worth that ? 

> Now, the GA is designed to treat false positives as 100 times worse than
> false negatives, so this is a very drastic situation for the GA. Faced with
> this problem, the proper thing for the GA to do is to try to reduce the
> score of the rule that affects the least amount of the spam pile.. well,
> given that PENIS_ENLARGE2 only matches 4.5% of spam, it's a good candidate
> for reduction.

I understand the goals of SA perfectly. But meanwhile, let me point this out:
There is another side to this coin.  Most people are much more aggravated and 
/ or embarassed by sexually oriented spam than by marketing mails about 
ehm... mortgages, to name something abundant.  In my perspective, it could be 
wise to focus somewhat more on the signs that really stand out as <evil spam> 
and not per se on the sheer volume of those.  Most customers I spoke to 
couldn't care less if they receive an extra mortgage offer, but DO tend to 
get angry when they get to see photos of gaping cunts and penis enlargers in 
a blinking cyan-font-colored html email. 
(Not that I myself am a prude -far from it- but you get the idea, right...)

Nice to exchange thoughts about this though. :-)

Kind regards,
Maarten

-- 
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER




-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] scoring system and values...

Reply via email to