Michael Moncur wrote: MM> When a new release comes out I like to be anal-retentive and go through the MM> GA second-guessing its scores. This is my report for 2.30.
A valuable service we've come to count on. MM> - RATWARE must be fixed, it was negative last time MM> score RATWARE 4.563 I think ratware should be split into "ratty ratware" and "sometimes ratware" rules, based on human prediction of which pieces of software are sometimes used by legitimate bulk mailers. MM> - This works well for me but users in some countries may want to change it MM> score SUBJ_FULL_OF_8BITS 4.298 I suspect we probably want to do 2 things here: 1. Apply this rule *before* header decoding 2. create a 65_scores_pl.cf, 65_scores_ru.cf, etc, etc, etc which resets the score for this rule to 0 MM> - This was 0.87 before. Less and less useful? MM> score FROM_AND_TO_SAME -2.071 I think this should be set to +2 or so, to counteract the being-in-your-own AWL problem which Dan Kohn mentioned recently. MM> - Not as weird as all that, apparently MM> score MSGID_CHARS_WEIRD -2.178 Looks like mail servers (Exchange and Netscape mail server) sometimes create message ids which look like: Message-Id: <p05111701b8f970233263@[198.142.175.158]> I don't know what the origin of the MSGID_CHARS_WEIRD rule was -- are there other uses of [] inside message ids which are bad? MM> - Disappointing, perhaps porn_word_test() needs tweaking MM> score PORN_3 0.522 I think the rule needs to be adjusted to not trigger on 3 words' presence in the message, since "asian" and "hardcore" can occur in legitimate messages. Instead, it should trigger based on %age of words which are in the list, so that longer messages aren't penalized. MM> - Lots of missing dates in non-spam? MM> score DATE_MISSING -2.140 In my own mail archive, there are a number of messages which I've had on file for years and years, which have been migrated through multiple message stores, which seem to have lost their Date: headers. Don't know how that happened. Some of these messages have gone mbox->PST->Exchange->PST->Exchange->PST->Exchange->PST->mbox->cyrus I think I'd be in favor of pushing the score up into +ve territory, since incoming legitimate messages will be a lot more likely to have date headers. MM> score ASCII_FORM_ENTRY -1.660 Looks like lots of false positives on the appended lines at the bottom of Sourceforge mailing list messages. This score should probably be pumped up a little. MM> score ASKS_BILLING_ADDRESS -0.152 I think this is a good score for the rule probably. MM> score DEAR_SOMEBODY -0.694 This one's been discussed heavily before. MM> score EXCUSE_16 -0.721 Lots of disclaimers from lawyers, accountants, bankers, etc contain this type of message in footers. MM> score FORGED_HOTMAIL_RCVD -0.356 Well, this is a bad score. In the corpus on this run, there were *no* instances of this rule in either spam or nonspam. This score should be reset manually to probably ~2. MM> score FROM_NAME_NO_SPACES -0.114 Willing to believe the GA on this one. MM> score GREEN_EXCUSE_1 -2.019 very odd score allocation -- it only appears in spam in the corpus, but drew a -ve score. In those 74 messages though, it always appears with tons of other highly spammy indicators though. I would suggest we manually reset this to ~1.5 MM> score INTL_EXEC_GUILD -0.039 Ditto. Only 16 instances in the corpus though here. MM> score LINES_OF_YELLING -0.036 Hmmm, rule need fixin? MM> score MONEY_BACK -0.239 MM> score MONEY_MAKING -0.687 These are both somewhat odd -- they occur overwhelmingly more often in spam than nonspam: [craig@belphegore masses]$ egrep 'MONEY_(MAKING|BACK)' freqs 1995 1989 6 MONEY_BACK 774 763 11 MONEY_MAKING I suppose they also occur always in conjunction with other strong spam signs, and so don't need a high score. The score should probably be set +ve or the rules removed. Probably the former. MM> score NO_REAL_NAME -1.068 [craig@belphegore masses]$ fgrep NO_REAL freqs 54280 48893 5387 NO_REAL_NAME Turns out this happens a lot in nonspam. I'd be in favor of leaving the rule in. Might actually be a sign of sysadmin bias here in the corpus. MM> score SUBJ_ALL_CAPS -0.054 MM> score SUBJ_ENDS_IN_Q_MARK -0.135 MM> score SUBJ_REMOVE -0.823 MM> score SUSPICIOUS_RECIPS -0.213 MM> score WEB_BUGS -0.430 MM> score X_AUTH_WARNING -0.703 MM> score X_ESMTP -1.662 MM> score X_MSMAIL_PRIORITY_HIGH -0.886 MM> score X_NOT_PRESENT -1.920 MM> score MAILTO_TO_REMOVE -1.669 Blah blah blah, similar excuses for all of these, but it's lunchtime and I've become hungry. MM> All in all, I believe the GA is really smarter than I am this time. :) *gasp* _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas - http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk