Last night, I wrote a small perl script to gauge the "goodness" of each rule and its score. The intent is to identify rules that need to be re-examined.
If a positively-scored rule matches a spam, it's goodness goes up by its score, but if it matches a non-spam, it's goodness goes down by its score. The inverse is true for negatively-scored rules. You can weight false-positives if you want (I didn't in the below table). Some interesting points: (1) I tried exempting all company-internal messages since that's what I do with my real mail (I don't exempt them when developing new rules, though, since some people might run internal mail through SA). However, when I didn't exempt them, it changed the results for some negative tests, always in the same direction: all messages internal excluded FROM_AND_TO_SAME bad good VERY_SUSP_CC_RECIPS bad good VERY_SUSP_RECIPS bad good X_NOT_PRESENT bad good X_PRIORITY_HIGH bad good A clearer way to put it is: the above tests don't work well if you run your local mail through them. It makes sense too. FROM_AND_TO_SAME - I mail myself notes VERY_SUSP_RECIPS and VERY_SUSP_CC_RECIPS - people use large internal To and Cc all the time X_NOT_PRESENT - Unix types sending quick notes via mailx, some internal "one-off" tools/programs don't have any X- headers. X_PRIORITY_HIGH - I'm less certain about this one, I guess people aren't shy about using this internally? (The below table is everything, both internal and external.) (2) Some "mailing lists are good" tests did not help me so much since I have been put on more than one semi-legit mailing list without someone asking me first. This could be a problem as more spammers start sending spam that look like mailing lists. (3) Why are these negative? MAILTO_WITH_SUBJ HTML_WITH_BGCOLOR SLIGHTLY_UNSAFE_JAVASCRIPT OPPORTUNITY ALL_CAPS_HEADER [ and more ] Seems like they've earned negative scores even though they are clearly spam detectors. Rules that aren't effective, but are not intended to detect legitimate mail should be scored at 0.0, not negatively. (4) Some rules match so rarely, they might as well get deleted. Maybe they match for someone else... (5) Clearly, a few rules need to be fixed. X_AUTH_WARNING especially. Even when I exempt internal mail, this one is still the worst of the lot. Notes: There are a few experimental/modified rules in here that aren't yet in the CVS tree: INVALID_DATE MIME_MISSING_BOUNDARY MIME_SUSPECT_NAME UNDESIRED_LANGUAGE_BODY X_PRECEDENCE_REF_PRESENT And two that are gone: INVALID_DATE_NO_TZ INVALID_DATE_ODD_MONTH These are sorted from worst to best. Rules near the top need to be reexamined, IMO. Some may need to have their score sign-reversed. Some just may need to be removed. This is all very me-ish, of course. Scores that are not GA-evolved have a "*" next to them. Table: rule goodness spams non-spams ------------------------------------------------------------------------ X_AUTH_WARNING -1217 57 1274 COPYRIGHT_CLAIMED -130 89 6 WEB_BUGS -125 153 1 LINES_OF_YELLING_3 -74 68 19 GAPPY_TEXT -61 61 11 MAILTO_WITH_SUBJ -47 154 HTML_WITH_BGCOLOR -42 77 SLIGHTLY_UNSAFE_JAVASCRIPT -33 43 1 SIGNATURE_DELIM -32 1 44 X_NOT_PRESENT -32 401 465 OPPORTUNITY -21 21 FROM_AND_TO_SAME -19 74 96 PLEASE_READ -15 2 18 PGP_SIGNATURE -14 12 5 RELAYING_FRAME -12 23 1 X_MSMAIL_PRIORITY_HIGH -12 17 8 X_PRIORITY_HIGH -11 24 31 DEAR_SOMEBODY -11 66 42 TO_BE_REMOVED_REPLY -10 5 ALL_CAPS_HEADER -7 26 VERY_SUSP_CC_RECIPS -6 71 75 MIME_NULL_BLOCK -5 36 VERY_SUSP_RECIPS -4 55 57 REPLY_TO_EMPTY -4 1 2 NO_QS_ASKED -3 5 NO_EXPERIENCE -3 3 RATWARE -2 3 EXCUSE_6 -2 19 TO_UNSUB_REPLY -1 1 JAVASCRIPT_URI -1 2 1 DOMAIN_SUBJECT -1 1 3 PORN_10 0 80 87 SUPERLONG_LINE 0 128 127 SUSPICIOUS_RECIPS 0 31 10 REAL_THING 0 2 TO_LOCALPART_EQ_REAL 0 12 12 MSG_ID_ADDED_BY_MTA 0 1 1 BE_AMAZED 0 2 2 ONCE_IN_LIFETIME 0 1 HTTP_NUMBER_WORD 0 1 HR_3113 0 1 AOL_USERS_LINK 0 1 ONLINE_BIZ_OPS * 1 1 X_MAIL_ID_PRESENT 1 1 ASCII_FORM_ENTRY 1 34 6 MURKOWSKI_CRUFT * 1 1 NO_DISSAPOINTMENT 1 1 FORGED_RCVD_FOUND * 1 3 WEIRD_PORT 1 29 COMMUNIGATE 1 1 HTML_EMBEDS 1 6 TRACE_BY_SSN 2 1 UNNEEDED_HTML_ENCODING 2 1 LOTS_OF_CC_LINES 2 1 URGENT_BIZ 2 1 FROM_NO_USER 2 3 FREE_PRIORITY_MAIL 2 1 X_PMFLAGS_PRESENT 2 1 CASHCASHCASH 2 17 1 X_ENC_PRESENT 3 3 GREEN_EXCUSE_2 3 1 SECTION_301 3 26 AUTO_EMAIL_REMOVAL 3 1 GREEN_EXCUSE_1 3 1 IMPOTENCE 3 1 DONT_DELETE 3 4 DIFFERENT_REPLY_TO 3 117 13 MYCASINOBUILDER 3 1 MAILMAN_CONFIRM * 4 1 UNSUB_SCRIPT 4 10 NIGERIAN_SCAM_2 4 1 WORK_AT_HOME 4 30 18 PORN_9 4 2 FROM_MALFORMED 4 2 UNDISC_RECIPS 4 4 US_DOLLARS 4 2 MIME_SUSPECT_NAME 5 12 2 NEW_DOMAIN_EXTENSIONS 5 2 THE_FOLLOWING_FORM 5 4 1 SERIOUS_ONLY 5 2 BILLION_DOLLARS 5 5 PORN_12 5 11 2 MASS_EMAIL 5 5 INCREASE_SALES 5 2 WWW_REMOVEYOU_COM 5 3 US_DOLLARS_2 5 30 CHARSET_FARAWAY_BODY * 6 3 COPY_ACCURATELY 6 7 PENNIES_A_DAY 6 2 TAKE_ACTION_NOW 6 3 EJACULATION 6 2 POST_IN_RCVD 6 2 NOT_INTENDED 7 4 EXCUSE_15 7 20 1 SUBJ_REMOVE 7 120 4 NUMERIC_HTTP_ADDR 7 3 EXCUSE_13 7 4 FOR_FREE 7 56 20 X_PRECEDENCE_REF_PRESENT 7 11 LARGE_HEX 7 11 4 PROFITS 7 17 SUBJ_2_CREDIT 8 3 MSGID_CHARS_WEIRD 9 7 1 RCVD_IN_VISI * 10 10 YOU_HAVE_BEEN_SELECTED 10 4 KNOWN_BAD_DIALUPS 10 50 42 STOCK_PICK 10 4 WANTS_CREDIT_CARD 10 7 PURE_PROFIT 11 4 REPLY_REMOVE_SUBJECT 11 39 URI_IS_POUND 11 20 7 CALL_FREE 11 87 76 READ_TO_END 11 3 S_1618 11 5 HTTP_USERNAME_USED 11 58 MICRO_CAP_WARNING 11 3 MAILTO_LINK 11 300 28 TO_MALFORMED 12 39 EXCUSE_17 13 5 LIMITED_TIME_ONLY 13 16 NO_CATCH 13 4 NONEXISTENT_CHARSET 13 7 FRIEND_AT_PUBLIC 14 8 SUBJ_ENDS_IN_Q_MARK 14 55 349 FOR_JUST_SOME_AMT 14 19 CBYI 15 6 TONER 15 5 TO_NO_USER 15 8 PLING_PLING 16 49 7 JODY 17 6 ONE_TIME_MAILING 18 7 SPAM_REDIRECTOR 20 10 PRINT_FORM_SIGNATURE 20 11 CASINO 20 15 2 FAKED_IP_IN_RCVD 20 19 ASKS_BILLING_ADDRESS 21 8 FREE_MONEY 21 21 ADVERT_CODE 21 20 EMAIL_MARKETING 21 30 GREAT_OFFER 21 16 INVESTOR_SPEC_SHEET 22 18 ADDRESSES_ON_CD 22 6 FOR_INSTANT_ACCESS 22 9 SEE_FOR_YOURSELF 25 10 RISK_FREE 25 12 SUBJ_MISSING 26 15 4 EXCUSE_1 27 14 2 COPY_DVDS 27 11 1 PORN_14 27 82 13 EXCUSE_4 27 32 PARA_A_2_C_OF_1618 27 13 KIFF 28 9 X_EM_VER_PRESENT 29 35 SENT_IN_COMPLIANCE 29 17 SOCIAL_SEC_NUMBER 29 13 X_X_PRESENT 30 30 EXCUSE_10 30 54 BULK_EMAIL 30 27 ALL_NATURAL 30 32 2 VJESTIKA 30 11 FROM_STARTS_WITH_NUMS 30 24 DEAR_FRIEND 33 17 1 RCVD_IN_RELAYS_ORDB_ORG * 34 17 YOUR_INCOME 34 9 FULL_REFUND 34 12 EXCUSE_12 34 12 GENTLE_FEROCITY 34 14 MONEY_MAKING 34 14 PORN_11 35 48 3 MSGID_HAS_NO_AT 35 28 REALLY_UNSAFE_JAVASCRIPT 36 11 BUGGY_CGI 36 9 CHARSET_FARAWAY 37 18 GAPPY_SUBJECT 37 14 SUSPICIOUS_CC_RECIPS 37 34 19 PORN_4 37 28 1 DIRECT_EMAIL 38 17 FORGED_HOTMAIL_RCVD 39 76 1 AMAZING 40 25 PRODUCED_AND_SENT_OUT 41 10 MSGID_SPAMSIGN_1 42 15 DOMAIN_BODY 43 9 NO_COST 43 49 7 THIS_AINT_SPAM 44 31 1 CLICK_TO_REMOVE_2 44 17 SMTPD_IN_RCVD 45 37 1 FROM_NAME_EQ_FROM_ADDR 45 20 1 FORGED_GW05_RCVD 48 17 SUBJ_HAS_Q_MARK 49 82 34 PENIS_ENLARGE2 51 14 LINES_OF_YELLING_2 52 125 35 FREE_CONSULTATION 52 13 HOME_EMPLOYMENT 53 26 MONEY_BACK 53 36 CHARSET_FARAWAY_HEADERS 54 30 INCREASE_TRAFFIC 54 21 1 CALL_NOW 54 25 CHECK_OR_MONEY_ORDER 57 17 STRONG_BUY 57 15 MAY_BE_FORGED 59 67 23 EXCUSE_7 59 45 2 NO_REAL_NAME 59 666 572 FORM_W_MAILTO_ACTION 59 20 WE_HATE_SPAM 60 20 JAVASCRIPT 61 37 1 UNSUB_PAGE 63 18 HTTP_ESCAPED_HOST 64 36 1 AS_SEEN_ON 64 30 EARN_PER_WEEK 65 14 FROM_BTAMAIL 67 21 PORN_3 69 167 52 PORN_13 71 17 RCVD_IN_RFCI * 72 158 14 MIME_MISSING_BOUNDARY 73 73 FORGED_EUDORAMAIL_RCVD 73 29 MISSING_HEADERS 77 99 16 LINES_OF_YELLING 77 259 88 INVALID_MSGID 77 86 35 EXCUSE_14 81 71 TO_EMPTY 83 33 EXCUSE_16 86 70 6 BILL_1618 91 19 REMOVAL_INSTRUCTIONS 94 25 US_DOLLARS_3 99 37 STOCK_ALERT 99 27 WE_HONOR_ALL 104 23 GUARANTEE 109 59 1 INVALID_DATE 110 76 3 NO_MX_FOR_FROM * 120 166 99 BASE64_ENC_TEXT 122 85 VIAGRA 123 53 1 MSGID_CHARS_SPAM 124 83 ONE_HUNDRED_PC_GUAR 127 29 PLING 127 318 83 MORTGAGE_RATES 132 30 X_OSIRU_SPAMWARE_SITE * 135 27 OPT_IN 136 65 FROM_NAME_NO_SPACES 140 348 67 MIME_ODD_CASE 141 141 MSG_ID_ADDED_BY_MTA_3 142 170 51 ONE_HUNDRED_PC_FREE 143 60 MAILTO_TO_SPAM_ADDR 154 122 1 REMOVE_IN_QUOTES 164 85 DATE_IN_FUTURE 169 73 MAILTO_WITH_SUBJ_REMOVE 177 95 UNDESIRED_LANGUAGE_BODY * 178 90 1 BUGZILLA_BUG 188 94 MAILTO_TO_REMOVE 191 143 FROM_HAS_MIXED_NUMS 203 105 1 HTTP_WITH_EMAIL_IN_URL 215 53 ROUND_THE_WORLD * 216 72 NO_OBLIGATION 232 91 SUBJ_FULL_OF_8BITS 241 77 FORGED_YAHOO_RCVD 243 125 3 SUBJ_ALL_CAPS 260 148 13 MSG_ID_ADDED_BY_MTA_2 324 139 4 SUBJ_HAS_UNIQ_ID 325 161 1 RCVD_IN_ORBS * 346 346 X_OSIRU_SPAM_SRC * 360 122 2 REMOVE_SUBJ 379 163 1 FRONTPAGE 415 87 FAKED_UNDISC_RECIPS 415 121 UNIFIED_PATCH * 440 88 FROM_ENDS_IN_NUMS 453 450 9 INVALID_DATE_TZ_ABSURD 484 228 CLICK_HERE_LINK 520 292 1 REMOVE_PAGE 633 181 NORMAL_HTTP_TO_IP 676 227 3 EXCUSE_3 802 294 2 SUBJ_HAS_SPACES 816 298 RCVD_IN_OSIRUSOFT_COM * 832 466 50 CLICK_BELOW 905 598 2 BIG_FONT 984 496 24 CTYPE_JUST_HTML 1968 629 5 IN_REP_TO 6828 1541 USER_IN_WHITELIST * 75000 2 752 _______________________________________________________________ Hundreds of nodes, one monster rendering program. Now that’s a super model! Visit http://clustering.foundries.sf.net/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk