Last night, I wrote a small perl script to gauge the "goodness" of
each rule and its score.  The intent is to identify rules that need to
be re-examined.

If a positively-scored rule matches a spam, it's goodness goes up by
its score, but if it matches a non-spam, it's goodness goes down by
its score.  The inverse is true for negatively-scored rules.  You can
weight false-positives if you want (I didn't in the below table).

Some interesting points:

(1) I tried exempting all company-internal messages since that's what
    I do with my real mail (I don't exempt them when developing new
    rules, though, since some people might run internal mail through
    SA).  However, when I didn't exempt them, it changed the results
    for some negative tests, always in the same direction:

                             all messages         internal excluded
      FROM_AND_TO_SAME            bad                    good
      VERY_SUSP_CC_RECIPS         bad                    good
      VERY_SUSP_RECIPS            bad                    good
      X_NOT_PRESENT               bad                    good
      X_PRIORITY_HIGH             bad                    good

    A clearer way to put it is: the above tests don't work well if you
    run your local mail through them.  It makes sense too.

    FROM_AND_TO_SAME - I mail myself notes
    VERY_SUSP_RECIPS and VERY_SUSP_CC_RECIPS - people use large
      internal To and Cc all the time
    X_NOT_PRESENT - Unix types sending quick notes via mailx, some
      internal "one-off" tools/programs don't have any X- headers.
    X_PRIORITY_HIGH - I'm less certain about this one, I guess people
      aren't shy about using this internally?

    (The below table is everything, both internal and external.)

(2) Some "mailing lists are good" tests did not help me so much since
    I have been put on more than one semi-legit mailing list without
    someone asking me first.  This could be a problem as more spammers
    start sending spam that look like mailing lists.

(3) Why are these negative?

    MAILTO_WITH_SUBJ
    HTML_WITH_BGCOLOR
    SLIGHTLY_UNSAFE_JAVASCRIPT
    OPPORTUNITY
    ALL_CAPS_HEADER
    [ and more ]

    Seems like they've earned negative scores even though they are
    clearly spam detectors.  Rules that aren't effective, but are not
    intended to detect legitimate mail should be scored at 0.0, not
    negatively.

(4) Some rules match so rarely, they might as well get deleted.  Maybe
    they match for someone else...

(5) Clearly, a few rules need to be fixed.  X_AUTH_WARNING especially.
    Even when I exempt internal mail, this one is still the worst of
    the lot.

Notes:

There are a few experimental/modified rules in here that aren't yet in
the CVS tree:

  INVALID_DATE
  MIME_MISSING_BOUNDARY
  MIME_SUSPECT_NAME
  UNDESIRED_LANGUAGE_BODY
  X_PRECEDENCE_REF_PRESENT

And two that are gone:

  INVALID_DATE_NO_TZ
  INVALID_DATE_ODD_MONTH

These are sorted from worst to best.  Rules near the top need to be
reexamined, IMO.  Some may need to have their score sign-reversed.
Some just may need to be removed.  This is all very me-ish, of course.

Scores that are not GA-evolved have a "*" next to them.

Table:

rule                              goodness     spams non-spams
------------------------------------------------------------------------
X_AUTH_WARNING                       -1217        57      1274
COPYRIGHT_CLAIMED                     -130        89         6
WEB_BUGS                              -125       153         1
LINES_OF_YELLING_3                     -74        68        19
GAPPY_TEXT                             -61        61        11
MAILTO_WITH_SUBJ                       -47       154
HTML_WITH_BGCOLOR                      -42        77
SLIGHTLY_UNSAFE_JAVASCRIPT             -33        43         1
SIGNATURE_DELIM                        -32         1        44
X_NOT_PRESENT                          -32       401       465
OPPORTUNITY                            -21        21
FROM_AND_TO_SAME                       -19        74        96
PLEASE_READ                            -15         2        18
PGP_SIGNATURE                          -14        12         5
RELAYING_FRAME                         -12        23         1
X_MSMAIL_PRIORITY_HIGH                 -12        17         8
X_PRIORITY_HIGH                        -11        24        31
DEAR_SOMEBODY                          -11        66        42
TO_BE_REMOVED_REPLY                    -10         5
ALL_CAPS_HEADER                         -7        26
VERY_SUSP_CC_RECIPS                     -6        71        75
MIME_NULL_BLOCK                         -5                  36
VERY_SUSP_RECIPS                        -4        55        57
REPLY_TO_EMPTY                          -4         1         2
NO_QS_ASKED                             -3         5
NO_EXPERIENCE                           -3         3
RATWARE                                 -2         3
EXCUSE_6                                -2        19
TO_UNSUB_REPLY                          -1         1
JAVASCRIPT_URI                          -1         2         1
DOMAIN_SUBJECT                          -1         1         3
PORN_10                                  0        80        87
SUPERLONG_LINE                           0       128       127
SUSPICIOUS_RECIPS                        0        31        10
REAL_THING                               0         2
TO_LOCALPART_EQ_REAL                     0        12        12
MSG_ID_ADDED_BY_MTA                      0         1         1
BE_AMAZED                                0         2         2
ONCE_IN_LIFETIME                         0         1
HTTP_NUMBER_WORD                         0         1
HR_3113                                  0         1
AOL_USERS_LINK                           0         1
ONLINE_BIZ_OPS                  *        1         1
X_MAIL_ID_PRESENT                        1         1
ASCII_FORM_ENTRY                         1        34         6
MURKOWSKI_CRUFT                 *        1         1
NO_DISSAPOINTMENT                        1         1
FORGED_RCVD_FOUND               *        1         3
WEIRD_PORT                               1        29
COMMUNIGATE                              1         1
HTML_EMBEDS                              1         6
TRACE_BY_SSN                             2         1
UNNEEDED_HTML_ENCODING                   2         1
LOTS_OF_CC_LINES                         2         1
URGENT_BIZ                               2         1
FROM_NO_USER                             2         3
FREE_PRIORITY_MAIL                       2         1
X_PMFLAGS_PRESENT                        2         1
CASHCASHCASH                             2        17         1
X_ENC_PRESENT                            3         3
GREEN_EXCUSE_2                           3         1
SECTION_301                              3        26
AUTO_EMAIL_REMOVAL                       3         1
GREEN_EXCUSE_1                           3         1
IMPOTENCE                                3         1
DONT_DELETE                              3         4
DIFFERENT_REPLY_TO                       3       117        13
MYCASINOBUILDER                          3         1
MAILMAN_CONFIRM                 *        4                   1
UNSUB_SCRIPT                             4        10
NIGERIAN_SCAM_2                          4         1
WORK_AT_HOME                             4        30        18
PORN_9                                   4         2
FROM_MALFORMED                           4         2
UNDISC_RECIPS                            4         4
US_DOLLARS                               4         2
MIME_SUSPECT_NAME                        5        12         2
NEW_DOMAIN_EXTENSIONS                    5         2
THE_FOLLOWING_FORM                       5         4         1
SERIOUS_ONLY                             5         2
BILLION_DOLLARS                          5         5
PORN_12                                  5        11         2
MASS_EMAIL                               5         5
INCREASE_SALES                           5         2
WWW_REMOVEYOU_COM                        5         3
US_DOLLARS_2                             5        30
CHARSET_FARAWAY_BODY            *        6         3
COPY_ACCURATELY                          6         7
PENNIES_A_DAY                            6         2
TAKE_ACTION_NOW                          6         3
EJACULATION                              6         2
POST_IN_RCVD                             6         2
NOT_INTENDED                             7         4
EXCUSE_15                                7        20         1
SUBJ_REMOVE                              7       120         4
NUMERIC_HTTP_ADDR                        7         3
EXCUSE_13                                7         4
FOR_FREE                                 7        56        20
X_PRECEDENCE_REF_PRESENT                 7        11
LARGE_HEX                                7        11         4
PROFITS                                  7        17
SUBJ_2_CREDIT                            8         3
MSGID_CHARS_WEIRD                        9         7         1
RCVD_IN_VISI                    *       10        10
YOU_HAVE_BEEN_SELECTED                  10         4
KNOWN_BAD_DIALUPS                       10        50        42
STOCK_PICK                              10         4
WANTS_CREDIT_CARD                       10         7
PURE_PROFIT                             11         4
REPLY_REMOVE_SUBJECT                    11        39
URI_IS_POUND                            11        20         7
CALL_FREE                               11        87        76
READ_TO_END                             11         3
S_1618                                  11         5
HTTP_USERNAME_USED                      11        58
MICRO_CAP_WARNING                       11         3
MAILTO_LINK                             11       300        28
TO_MALFORMED                            12        39
EXCUSE_17                               13         5
LIMITED_TIME_ONLY                       13        16
NO_CATCH                                13         4
NONEXISTENT_CHARSET                     13         7
FRIEND_AT_PUBLIC                        14         8
SUBJ_ENDS_IN_Q_MARK                     14        55       349
FOR_JUST_SOME_AMT                       14        19
CBYI                                    15         6
TONER                                   15         5
TO_NO_USER                              15         8
PLING_PLING                             16        49         7
JODY                                    17         6
ONE_TIME_MAILING                        18         7
SPAM_REDIRECTOR                         20        10
PRINT_FORM_SIGNATURE                    20        11
CASINO                                  20        15         2
FAKED_IP_IN_RCVD                        20        19
ASKS_BILLING_ADDRESS                    21         8
FREE_MONEY                              21        21
ADVERT_CODE                             21        20
EMAIL_MARKETING                         21        30
GREAT_OFFER                             21        16
INVESTOR_SPEC_SHEET                     22        18
ADDRESSES_ON_CD                         22         6
FOR_INSTANT_ACCESS                      22         9
SEE_FOR_YOURSELF                        25        10
RISK_FREE                               25        12
SUBJ_MISSING                            26        15         4
EXCUSE_1                                27        14         2
COPY_DVDS                               27        11         1
PORN_14                                 27        82        13
EXCUSE_4                                27        32
PARA_A_2_C_OF_1618                      27        13
KIFF                                    28         9
X_EM_VER_PRESENT                        29        35
SENT_IN_COMPLIANCE                      29        17
SOCIAL_SEC_NUMBER                       29        13
X_X_PRESENT                             30        30
EXCUSE_10                               30        54
BULK_EMAIL                              30        27
ALL_NATURAL                             30        32         2
VJESTIKA                                30        11
FROM_STARTS_WITH_NUMS                   30        24
DEAR_FRIEND                             33        17         1
RCVD_IN_RELAYS_ORDB_ORG         *       34        17
YOUR_INCOME                             34         9
FULL_REFUND                             34        12
EXCUSE_12                               34        12
GENTLE_FEROCITY                         34        14
MONEY_MAKING                            34        14
PORN_11                                 35        48         3
MSGID_HAS_NO_AT                         35        28
REALLY_UNSAFE_JAVASCRIPT                36        11
BUGGY_CGI                               36         9
CHARSET_FARAWAY                         37        18
GAPPY_SUBJECT                           37        14
SUSPICIOUS_CC_RECIPS                    37        34        19
PORN_4                                  37        28         1
DIRECT_EMAIL                            38        17
FORGED_HOTMAIL_RCVD                     39        76         1
AMAZING                                 40        25
PRODUCED_AND_SENT_OUT                   41        10
MSGID_SPAMSIGN_1                        42        15
DOMAIN_BODY                             43         9
NO_COST                                 43        49         7
THIS_AINT_SPAM                          44        31         1
CLICK_TO_REMOVE_2                       44        17
SMTPD_IN_RCVD                           45        37         1
FROM_NAME_EQ_FROM_ADDR                  45        20         1
FORGED_GW05_RCVD                        48        17
SUBJ_HAS_Q_MARK                         49        82        34
PENIS_ENLARGE2                          51        14
LINES_OF_YELLING_2                      52       125        35
FREE_CONSULTATION                       52        13
HOME_EMPLOYMENT                         53        26
MONEY_BACK                              53        36
CHARSET_FARAWAY_HEADERS                 54        30
INCREASE_TRAFFIC                        54        21         1
CALL_NOW                                54        25
CHECK_OR_MONEY_ORDER                    57        17
STRONG_BUY                              57        15
MAY_BE_FORGED                           59        67        23
EXCUSE_7                                59        45         2
NO_REAL_NAME                            59       666       572
FORM_W_MAILTO_ACTION                    59        20
WE_HATE_SPAM                            60        20
JAVASCRIPT                              61        37         1
UNSUB_PAGE                              63        18
HTTP_ESCAPED_HOST                       64        36         1
AS_SEEN_ON                              64        30
EARN_PER_WEEK                           65        14
FROM_BTAMAIL                            67        21
PORN_3                                  69       167        52
PORN_13                                 71        17
RCVD_IN_RFCI                    *       72       158        14
MIME_MISSING_BOUNDARY                   73        73
FORGED_EUDORAMAIL_RCVD                  73        29
MISSING_HEADERS                         77        99        16
LINES_OF_YELLING                        77       259        88
INVALID_MSGID                           77        86        35
EXCUSE_14                               81        71
TO_EMPTY                                83        33
EXCUSE_16                               86        70         6
BILL_1618                               91        19
REMOVAL_INSTRUCTIONS                    94        25
US_DOLLARS_3                            99        37
STOCK_ALERT                             99        27
WE_HONOR_ALL                           104        23
GUARANTEE                              109        59         1
INVALID_DATE                           110        76         3
NO_MX_FOR_FROM                  *      120       166        99
BASE64_ENC_TEXT                        122        85
VIAGRA                                 123        53         1
MSGID_CHARS_SPAM                       124        83
ONE_HUNDRED_PC_GUAR                    127        29
PLING                                  127       318        83
MORTGAGE_RATES                         132        30
X_OSIRU_SPAMWARE_SITE           *      135        27
OPT_IN                                 136        65
FROM_NAME_NO_SPACES                    140       348        67
MIME_ODD_CASE                          141       141
MSG_ID_ADDED_BY_MTA_3                  142       170        51
ONE_HUNDRED_PC_FREE                    143        60
MAILTO_TO_SPAM_ADDR                    154       122         1
REMOVE_IN_QUOTES                       164        85
DATE_IN_FUTURE                         169        73
MAILTO_WITH_SUBJ_REMOVE                177        95
UNDESIRED_LANGUAGE_BODY         *      178        90         1
BUGZILLA_BUG                           188                  94
MAILTO_TO_REMOVE                       191       143
FROM_HAS_MIXED_NUMS                    203       105         1
HTTP_WITH_EMAIL_IN_URL                 215        53
ROUND_THE_WORLD                 *      216        72
NO_OBLIGATION                          232        91
SUBJ_FULL_OF_8BITS                     241        77
FORGED_YAHOO_RCVD                      243       125         3
SUBJ_ALL_CAPS                          260       148        13
MSG_ID_ADDED_BY_MTA_2                  324       139         4
SUBJ_HAS_UNIQ_ID                       325       161         1
RCVD_IN_ORBS                    *      346       346
X_OSIRU_SPAM_SRC                *      360       122         2
REMOVE_SUBJ                            379       163         1
FRONTPAGE                              415        87
FAKED_UNDISC_RECIPS                    415       121
UNIFIED_PATCH                   *      440                  88
FROM_ENDS_IN_NUMS                      453       450         9
INVALID_DATE_TZ_ABSURD                 484       228
CLICK_HERE_LINK                        520       292         1
REMOVE_PAGE                            633       181
NORMAL_HTTP_TO_IP                      676       227         3
EXCUSE_3                               802       294         2
SUBJ_HAS_SPACES                        816       298
RCVD_IN_OSIRUSOFT_COM           *      832       466        50
CLICK_BELOW                            905       598         2
BIG_FONT                               984       496        24
CTYPE_JUST_HTML                       1968       629         5
IN_REP_TO                             6828                1541
USER_IN_WHITELIST               *    75000         2       752

_______________________________________________________________

Hundreds of nodes, one monster rendering program.
Now that’s a super model! Visit http://clustering.foundries.sf.net/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to