Re: [SAtalk] Evaluation of 2.30 GA scores

Craig R Hughes Sat, 15 Jun 2002 17:58:12 -0700

Michael Moncur wrote:

MM> When a new release comes out I like to be anal-retentive and go through the
MM> GA second-guessing its scores. This is my report for 2.30.


A valuable service we've come to count on.

MM> - RATWARE must be fixed, it was negative last time
MM> score RATWARE                        4.563

I think ratware should be split into "ratty ratware" and "sometimes ratware"
rules, based on human prediction of which pieces of software are sometimes used
by legitimate bulk mailers.

MM> - This works well for me but users in some countries may want to change it
MM> score SUBJ_FULL_OF_8BITS             4.298

I suspect we probably want to do 2 things here:

1. Apply this rule *before* header decoding
2. create a 65_scores_pl.cf, 65_scores_ru.cf, etc, etc, etc which resets the
score for this rule to 0

MM> - This was 0.87 before. Less and less useful?
MM> score FROM_AND_TO_SAME               -2.071

I think this should be set to +2 or so, to counteract the being-in-your-own AWL
problem which Dan Kohn mentioned recently.

MM> - Not as weird as all that, apparently
MM> score MSGID_CHARS_WEIRD              -2.178

Looks like mail servers (Exchange and Netscape mail server) sometimes create
message ids which look like:

Message-Id: <p05111701b8f970233263@[198.142.175.158]>

I don't know what the origin of the MSGID_CHARS_WEIRD rule was -- are there
other uses of [] inside message ids which are bad?

MM> - Disappointing, perhaps porn_word_test() needs tweaking
MM> score PORN_3                         0.522

I think the rule needs to be adjusted to not trigger on 3 words' presence in the
message, since "asian" and "hardcore" can occur in legitimate messages.
Instead, it should trigger based on %age of words which are in the list, so that
longer messages aren't penalized.

MM> - Lots of missing dates in non-spam?
MM> score DATE_MISSING                   -2.140

In my own mail archive, there are a number of messages which I've had on file
for years and years, which have been migrated through multiple message stores,
which seem to have lost their Date: headers.  Don't know how that happened.
Some of these messages have gone

mbox->PST->Exchange->PST->Exchange->PST->Exchange->PST->mbox->cyrus

I think I'd be in favor of pushing the score up into +ve territory, since
incoming legitimate messages will be a lot more likely to have date headers.

MM> score ASCII_FORM_ENTRY               -1.660

Looks like lots of false positives on the appended lines at the bottom of
Sourceforge mailing list messages.  This score should probably be pumped up a
little.

MM> score ASKS_BILLING_ADDRESS           -0.152

I think this is a good score for the rule probably.

MM> score DEAR_SOMEBODY                  -0.694

This one's been discussed heavily before.

MM> score EXCUSE_16                      -0.721

Lots of disclaimers from lawyers, accountants, bankers, etc contain this type of
message in footers.

MM> score FORGED_HOTMAIL_RCVD            -0.356

Well, this is a bad score.  In the corpus on this run, there were *no* instances
of this rule in either spam or nonspam.  This score should be reset manually to
probably ~2.

MM> score FROM_NAME_NO_SPACES            -0.114

Willing to believe the GA on this one.

MM> score GREEN_EXCUSE_1                 -2.019

very odd score allocation -- it only appears in spam in the corpus, but drew a
-ve score.  In those 74 messages though, it always appears with tons of other
highly spammy indicators though.  I would suggest we manually reset this to ~1.5

MM> score INTL_EXEC_GUILD                -0.039

Ditto.  Only 16 instances in the corpus though here.

MM> score LINES_OF_YELLING               -0.036

Hmmm, rule need fixin?

MM> score MONEY_BACK                     -0.239
MM> score MONEY_MAKING                   -0.687

These are both somewhat odd -- they occur overwhelmingly more often in spam than
nonspam:

[craig@belphegore masses]$ egrep 'MONEY_(MAKING|BACK)' freqs
      1995 1989 6 MONEY_BACK
       774 763 11 MONEY_MAKING

I suppose they also occur always in conjunction with other strong spam signs,
and so don't need a high score.  The score should probably be set +ve or the
rules removed.  Probably the former.

MM> score NO_REAL_NAME                   -1.068

[craig@belphegore masses]$ fgrep NO_REAL freqs
     54280 48893 5387 NO_REAL_NAME

Turns out this happens a lot in nonspam.  I'd be in favor of leaving the rule
in.  Might actually be a sign of sysadmin bias here in the corpus.

MM> score SUBJ_ALL_CAPS                  -0.054
MM> score SUBJ_ENDS_IN_Q_MARK            -0.135
MM> score SUBJ_REMOVE                    -0.823
MM> score SUSPICIOUS_RECIPS              -0.213
MM> score WEB_BUGS                       -0.430
MM> score X_AUTH_WARNING                 -0.703
MM> score X_ESMTP                        -1.662
MM> score X_MSMAIL_PRIORITY_HIGH         -0.886
MM> score X_NOT_PRESENT                  -1.920
MM> score MAILTO_TO_REMOVE               -1.669

Blah blah blah, similar excuses for all of these, but it's lunchtime and I've
become hungry.

MM> All in all, I believe the GA is really smarter than I am this time. :)

*gasp*


_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas - 
http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Evaluation of 2.30 GA scores

Reply via email to