Thanks Daryl and Matt, Daryl C. W. O'Shea wrote: > On the web, http://ruleqa.spamassassin.org/
Thanks! > > What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1? > > OVERALL SPAM% HAM% S/O RANK SCORE NAME > 1.116 1.5957 0.2705 0.855 0.51 2.08 SUBJ_ALL_CAPS Am I reading that correctly to see that in spam all caps showed up in 1.60% of the regression corpus and only in 0.27% of the non-spam? Gosh that seems like a very small indicator. Matt Kettler wrote: > You can also grab them from the web image of SVN: > http://svn.apache.org/repos/asf/spamassassin/branches/3.2/rules/ Cool stuff. > However, bear in mind, scores are not assigned based on the S/O of the > rule alone. The whole ruleset is scored collectively as one giant group, > and tuned to get the best results. Through the genetic algorithm, yes. And I know the score on this rule is just a component. But in this announcement message (congratulating someone on an acheivement) they were using all caps like crazy along with other things and the sum total of things made it difficult to distinguish the message from a typical spam message. I didn't think the SA rule here was undesirable in this case. The message was hard for my eye on a quick glance to distinguish and so I wanted to educate the sender to improve the announcement messages in the future. But then I wondered how many spam messages actually sent things in all caps. That used to be more true in the old days but not so much these days. As far as I know. But I figured the spam corpus would provide the data and I didn't figure out how to find it and so decided to ask. > A rule with a high-ish score, and not so great S/O suggests this rule's > false positives commonly coincide with strong negative scoring rules. > Based on that, the score assignment system will give it a "unfairly > high" score because it results in fewer FPs than assigning a higher > score to some other rule that has a better S/O, but its nonspam hits are > not compensated by negative scoring rule and would result in more FPs. > > The whole thing gets a lot complicated, but when you start to realize > that every rule's score is not only a function of its own hit-rate, but > also what other rules it gets grouped with you start to get a feel for > what's going on. Of course, strictly evaluating all combinations of all > rules would be very computationally expensive, which is why we use a > perceptron which generates an estimate. (I believe it's an successive > approximation type deal, but I'm not deeply familiar with its internal > workings) All good background information. Thanks for educating me. Again, just to be pedantic, I didn't have a complaint about SUBJ_ALL_CAPS. I think it is okay. But the above does explain the score with DRUGS_STOCK_MIMEOLE perhaps. That was my other message and I do think it scores those messages too agressively since it looks like it hits with a normal version of MS Outlook. But that is already logged in the tracker. Thanks! Bob