Re: Score Hit Frequency in SA Corpus?

Bob Proulx Sun, 21 Sep 2008 17:40:10 -0700

Thanks Daryl and Matt,

Daryl C. W. O'Shea wrote:
> On the web, http://ruleqa.spamassassin.org/

Thanks!

> > What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1?
>
> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>   1.116   1.5957   0.2705    0.855   0.51    2.08  SUBJ_ALL_CAPS

Am I reading that correctly to see that in spam all caps showed up in
1.60% of the regression corpus and only in 0.27% of the non-spam?
Gosh that seems like a very small indicator.

Matt Kettler wrote:
> You can also grab them from the web image of SVN:
> http://svn.apache.org/repos/asf/spamassassin/branches/3.2/rules/

Cool stuff.

> However, bear in mind, scores are not assigned based on the S/O of the
> rule alone. The whole ruleset is scored collectively as one giant group,
> and tuned to get the best results.

Through the genetic algorithm, yes.  And I know the score on this rule
is just a component.  But in this announcement message (congratulating
someone on an acheivement) they were using all caps like crazy along
with other things and the sum total of things made it difficult to
distinguish the message from a typical spam message.

I didn't think the SA rule here was undesirable in this case.  The
message was hard for my eye on a quick glance to distinguish and so I
wanted to educate the sender to improve the announcement messages in
the future.  But then I wondered how many spam messages actually sent
things in all caps.  That used to be more true in the old days but not
so much these days.  As far as I know.  But I figured the spam corpus
would provide the data and I didn't figure out how to find it and so
decided to ask.

> A rule with a high-ish score, and not so great S/O suggests this rule's
> false positives commonly coincide with strong negative scoring rules.
> Based on that, the score assignment system will give it a "unfairly
> high" score because it results in fewer FPs than assigning a higher
> score to some other rule that has a better S/O, but its nonspam hits are
> not compensated by negative scoring rule and would result in more FPs.
> 
> The whole thing gets a lot complicated, but when you start to realize
> that every rule's score is not only a function of its own hit-rate, but
> also what other rules it gets grouped with you start to get a feel for
> what's going on. Of course, strictly evaluating all combinations of all
> rules would be very computationally expensive, which is why we use a
> perceptron which generates an estimate. (I believe it's an successive
> approximation type deal, but I'm not deeply familiar with its internal
> workings)

All good background information.  Thanks for educating me.

Again, just to be pedantic, I didn't have a complaint about
SUBJ_ALL_CAPS.  I think it is okay.  But the above does explain the
score with DRUGS_STOCK_MIMEOLE perhaps.  That was my other message and
I do think it scores those messages too agressively since it looks
like it hits with a normal version of MS Outlook.  But that is already
logged in the tracker.

Thanks!
Bob

Re: Score Hit Frequency in SA Corpus?

Reply via email to