Michael Moncur wrote:
MM> When a new release comes out I like to be anal-retentive and go through the
MM> GA second-guessing its scores. This is my report for 2.30.
A valuable service we've come to count on.
MM> - RATWARE must be fixed, it was negative last time
MM> score RATWARE 4.563
I think ratware should be split into "ratty ratware" and "sometimes ratware"
rules, based on human prediction of which pieces of software are sometimes used
by legitimate bulk mailers.
MM> - This works well for me but users in some countries may want to change it
MM> score SUBJ_FULL_OF_8BITS 4.298
I suspect we probably want to do 2 things here:
1. Apply this rule *before* header decoding
2. create a 65_scores_pl.cf, 65_scores_ru.cf, etc, etc, etc which resets the
score for this rule to 0
MM> - This was 0.87 before. Less and less useful?
MM> score FROM_AND_TO_SAME -2.071
I think this should be set to +2 or so, to counteract the being-in-your-own AWL
problem which Dan Kohn mentioned recently.
MM> - Not as weird as all that, apparently
MM> score MSGID_CHARS_WEIRD -2.178
Looks like mail servers (Exchange and Netscape mail server) sometimes create
message ids which look like:
Message-Id: <p05111701b8f970233263@[198.142.175.158]>
I don't know what the origin of the MSGID_CHARS_WEIRD rule was -- are there
other uses of [] inside message ids which are bad?
MM> - Disappointing, perhaps porn_word_test() needs tweaking
MM> score PORN_3 0.522
I think the rule needs to be adjusted to not trigger on 3 words' presence in the
message, since "asian" and "hardcore" can occur in legitimate messages.
Instead, it should trigger based on %age of words which are in the list, so that
longer messages aren't penalized.
MM> - Lots of missing dates in non-spam?
MM> score DATE_MISSING -2.140
In my own mail archive, there are a number of messages which I've had on file
for years and years, which have been migrated through multiple message stores,
which seem to have lost their Date: headers. Don't know how that happened.
Some of these messages have gone
mbox->PST->Exchange->PST->Exchange->PST->Exchange->PST->mbox->cyrus
I think I'd be in favor of pushing the score up into +ve territory, since
incoming legitimate messages will be a lot more likely to have date headers.
MM> score ASCII_FORM_ENTRY -1.660
Looks like lots of false positives on the appended lines at the bottom of
Sourceforge mailing list messages. This score should probably be pumped up a
little.
MM> score ASKS_BILLING_ADDRESS -0.152
I think this is a good score for the rule probably.
MM> score DEAR_SOMEBODY -0.694
This one's been discussed heavily before.
MM> score EXCUSE_16 -0.721
Lots of disclaimers from lawyers, accountants, bankers, etc contain this type of
message in footers.
MM> score FORGED_HOTMAIL_RCVD -0.356
Well, this is a bad score. In the corpus on this run, there were *no* instances
of this rule in either spam or nonspam. This score should be reset manually to
probably ~2.
MM> score FROM_NAME_NO_SPACES -0.114
Willing to believe the GA on this one.
MM> score GREEN_EXCUSE_1 -2.019
very odd score allocation -- it only appears in spam in the corpus, but drew a
-ve score. In those 74 messages though, it always appears with tons of other
highly spammy indicators though. I would suggest we manually reset this to ~1.5
MM> score INTL_EXEC_GUILD -0.039
Ditto. Only 16 instances in the corpus though here.
MM> score LINES_OF_YELLING -0.036
Hmmm, rule need fixin?
MM> score MONEY_BACK -0.239
MM> score MONEY_MAKING -0.687
These are both somewhat odd -- they occur overwhelmingly more often in spam than
nonspam:
[craig@belphegore masses]$ egrep 'MONEY_(MAKING|BACK)' freqs
1995 1989 6 MONEY_BACK
774 763 11 MONEY_MAKING
I suppose they also occur always in conjunction with other strong spam signs,
and so don't need a high score. The score should probably be set +ve or the
rules removed. Probably the former.
MM> score NO_REAL_NAME -1.068
[craig@belphegore masses]$ fgrep NO_REAL freqs
54280 48893 5387 NO_REAL_NAME
Turns out this happens a lot in nonspam. I'd be in favor of leaving the rule
in. Might actually be a sign of sysadmin bias here in the corpus.
MM> score SUBJ_ALL_CAPS -0.054
MM> score SUBJ_ENDS_IN_Q_MARK -0.135
MM> score SUBJ_REMOVE -0.823
MM> score SUSPICIOUS_RECIPS -0.213
MM> score WEB_BUGS -0.430
MM> score X_AUTH_WARNING -0.703
MM> score X_ESMTP -1.662
MM> score X_MSMAIL_PRIORITY_HIGH -0.886
MM> score X_NOT_PRESENT -1.920
MM> score MAILTO_TO_REMOVE -1.669
Blah blah blah, similar excuses for all of these, but it's lunchtime and I've
become hungry.
MM> All in all, I believe the GA is really smarter than I am this time. :)
*gasp*
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -
http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk