-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Per the FAQ, when someone has rules that could be added to the SA distribution, > Next you should ... post the test rules to [the] list, and ask others > to check against their corpora. List members may also suggest > improvements to the tests.
I've begun to make some progress in comparing my personal rules to 2.60, and am submitting what I consider the most promising for consideration for inclusion in the distribution rule set. I've gotten through my "to", "from", and "subject" rules so far. Others will follow. Please let me know if you have problems or suggestions for these rules. If you are able to run these against your own corpus, please do and let me know what results you get. (My scores are set for a 9.0 spam threshold -- YMMV.) After about a week of feedback, I'll submit those that still seem worthwhile for formal consideration by the development team via bugzilla, as instructed by the FAQ. Many thanks. Bob Menschel -----BEGIN PGP SIGNATURE----- Version: PGP 8.0 iQA/AwUBP8cCv5ebK8E4qh1HEQJ4agCfRX7xZF2kblOdGgY0DbmLDnzE6rUAnipn vywLD1E8k/WMCLH/OAy/VCjZ =cQfJ -----END PGP SIGNATURE----- header RM_tz_Insurance ToCc =~ /Insurance/i describe RM_tz_Insurance Addressed to "Insurance" department or person score RM_tz_Insurance 1.18 # 29s/1h of 53752 corpus header RM_fa_BaliHotels From =~ /\bbali.{0,30}\.com/i describe RM_fa_BaliHotels From a probable spammer score RM_fa_BaliHotels 7.00 # 19s/0h of 58857 corpus # spamming travel agent uses multiple www.bali-hotel.com and related # domain names in From and in URIs header RM_fw_Amazing From =~ /amazing/i describe RM_fw_Amazing Sender name includes the word: Amazing score RM_fw_Amazing 1.34 # 34 spam, 0 ham, Sep 12 2003; 12s/0h of 58857 corpus header RM_fw_Phrase1 From =~ /\w+_\w+_\w+_\w+_/i describe RM_fw_Phrase1 Sender name appears to be phrase rather than name score RM_fw_Phrase1 1.29 # 29s/0h of 58857 corpus header RM_fw_LeadingPrep From =~ /^(?:A|About|All|An|And|Any|As|At|Be|Best|Bulk|Cash|Earn|Easy|Fast|Find|For|Free|From|Get|Hi|Home|In|Instant|Is|It|Its|Limited|Lose|Love|Make|Need|New|No|Save|Sex|She|Special|Stock|Stop|Take|Test|There|This|To|Try|Want|We|What|Where|Why|You|Your)[_ ]/i describe RM_fw_LeadingPrep From begins with preposition or similar word, a, all, any, free, get score RM_fw_LeadingPrep 2.00 # 637 spam, 1 ham, Aug 3 2003; 273s/2h of 58857 corpus header RM_fw_Vword From =~ /Vword/i <=== Grunged to get past list filter -- replace with original V-word before using. describe RM_fw_Vword Sender name contains a known spam word score RM_fw_Vword 1.11 # 11 spam, 0 ham, Oct 1 2003; 7s/0h of 58857 corpus header RM_fl_ConsWord9 From =~ /\b[bcghjklmnpqrstvwxz]{9,20}\b/ describe RM_fl_ConsWord9 From contains word consisting of consecutive consonants score RM_fl_ConsWord9 1.740 # 74 spam, 1 ham, Oct 25 2003; 41s/0h of 58857 corpus header RM_ft_Noname From =~ /"" \</i describe RM_ft_Noname Sender has blanked out name score RM_ft_Noname 3.00 # 913s/3h of 58857 corpus header CMO_RM_sp_AdultMovie Subject =~ /[EMAIL PROTECTED]|\xA3][\W_]?t ?(?:\/\\\/\\|\/V\\|rn|[m])[\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[vu][\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?[e3\*\xC8-\xCB\xE8-\xEB]/i describe CMO_RM_sp_AdultMovie Subject mentions adult movie(s) score CMO_RM_sp_AdultMovie 1.670 # 65s/0h of 58857 corpus # simpler regex which hits the same 65/0: /[EMAIL PROTECTED] m[o0]vie/i header CMO_RM_sp_BannedCD Subject =~ /[EMAIL PROTECTED]/i describe CMO_RM_sp_BannedCD Subject mentions the supposedly banned CD score CMO_RM_sp_BannedCD 1.540 # 54s/0h of 58857 corpus # All 54 of this spam is caught by the much simpler /b\s?a\s?n\s?n\s?e\s?d\s?c\s?d/i header RM_sp_CopyDVD Subject =~ /(?:c[o0]py dvd|dvd.{1,15}c[o0]py|dvd magic)/i describe RM_sp_CopyDVD Subject mentions copying DVDs score RM_sp_CopyDVD 2.340 # 134s/0h of 58857 corpus header RM_sp_FindYour Subject =~ /find your/i describe RM_sp_FindYour Subject suggests you find something score RM_sp_FindYour 1.160 # 16s/0h of 58857 corpus header RM_sp_FreePPV Subject =~ /free [EMAIL PROTECTED] -]?per[ -]?view/i describe RM_sp_FreePPV Subject mentions free pay-per-view score RM_sp_FreePPV 2.430 # 143s/0h of 58857 corpus header CMO_RM_sp_GiftCard Subject =~ /[g6][\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?f[\W_]?t [EMAIL PROTECTED]/i describe CMO_RM_sp_GiftCard Subject mentions a gift card score CMO_RM_sp_GiftCard 1.690 # 69 spam, 0 ham, Aug 9 2003; 20s/0h of 58857 corpus header RM_sp_LookingFor Subject =~ /(?:(?:We are|we're) looking for|looking for.{0,15}\W?(?:mortgage|love|customers|investment|employees|you|people|consultants|sex|someone|career|honest))\b/i describe RM_sp_LookingFor Subject mentions looking for something score RM_sp_LookingFor 1.380 # 38 spam, 0 ham, Aug 6 2003; 14s/0h of 58857 corpus header RM_sp_RuinAnyone Subject =~ /ruin anyone anywhere/i describe RM_sp_RuinAnyone Subject suggests you can ruin anyone anywhere score RM_sp_RuinAnyone 1.240 # 24 spam, 0 ham, Aug 31 2003; 23s/0h of 58857 corpus header RM_sp_TONER Subject =~ /\b(?:printer[-\s]*)?(?:[EMAIL PROTECTED])?(?:t[o0]ner|ink(?:[-\s]*jet)?|[EMAIL PROTECTED]|copier)[-\s]+(?:[EMAIL PROTECTED]|supply)/i describe RM_sp_TONER Subject contains Toner or Ink Cartridge score RM_sp_TONER 1.710 # 72 spam, 0 ham, Aug 20 2003; 61s/0h of 58857 corpus # Many people have contributed to this regex over the months... header RE_RM_sp_TooHigh Subject =~ /too high/i describe RE_RM_sp_TooHigh Too high in subject score RE_RM_sp_TooHigh 1.580 # 58s/0h of 58857 corpus header RM_sp_WillAstonish Subject =~ /Will Astonish You/i describe RM_sp_WillAstonish Subject says this will astonish you. score RM_sp_WillAstonish 1.170 # 17 spam, 0 ham, Aug 19 2003; 6s/0h of 58857 corpus header CMO_RM_spd_GetPaid Subject =~ /[g6][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?t [EMAIL PROTECTED]:\|\*\xCC-\xCF\xEC-\xEF][\W_]?[d\xD0]/i describe CMO_RM_spd_GetPaid Subject mentions getting paid for something score CMO_RM_spd_GetPaid 1.640 # 64s/0h of 58857 corpus header RM_spd_Money Subject =~ /(?:(?:save|make)[ -].{0,15}money[ -](?:in|on)|(?:free|grant|saving|with our|(?:claim|keep) your) money|money machine)/i describe RM_spd_Money Subject mentions money in phrase that implies spam score RM_spd_Money 1.380 # 76s/1h of 58857 corpus header RM_spd_StockMarket Subject =~ /STOCK MARKET/i describe RM_spd_StockMarket Subject mentions a/the STOCK MARKET score RM_spd_StockMarket 3.000 # 210s/0h of 58857 corpus header RM_spd_WorthCash Subject =~ /\b(?:Worth|Win|take|extra|earn|dollars|Short|need|claim|free|get|opinions?|surveys?)\b.{0,15}(?:fast)?(?:[EMAIL PROTECTED]|M[0o]ney)\b/i describe RM_spd_WorthCash Subject mentions something is worth cash score RM_spd_WorthCash 3.000 # 209s/0h of 58857 corpus header CMO_RM_spe_BiggerMember Subject =~ /[b8\xDF][\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?[g6][\W_]?[g6][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[r\xAE][\W_]?(?:\/\\\/\\|\/V\\|rn|[m])[\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?(?:\/\\\/\\|\/V\\|rn|[m])[\W_]?[b8\xDF][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[r\xAE]/i describe CMO_RM_spe_BiggerMember Subject mentions bigger body part score CMO_RM_spe_BiggerMember 1.140 # 14s/0h of 58857 corpus # Unlike the Banned CD rule, where my simpler regex caught all the spam that Chris' CMO rule catches, # this CMO rule catches twice as much spam as my original rule. header RE_RM_spm_ImproveYour Subject =~ /improve your/i describe RE_RM_spm_ImproveYour Subject suggests you improve something score RE_RM_spm_ImproveYour 1.570 # 57s/0h of 58857 corpus header CMO_RM_sw_boost Subject =~ /[b8\xDF][\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[s5\$\xA7][\W_]?t/i describe CMO_RM_sw_boost boost in Subject score CMO_RM_sw_boost 2.980 # 198s/1h of 58857 corpus header CMO_RM_sw_Forever Subject =~ /\bf[\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[r\xAE][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[vu][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[r\xAE]\b/i describe CMO_RM_sw_Forever Forever in Subject score CMO_RM_sw_Forever 1.760 # 76s/0h of 58857 corpus header RM_sw_ForWomen Subject =~ /Women:/i describe RM_sw_ForWomen Subject appears to be for women as a class, therefore possible spam score RM_sw_ForWomen 1.33 # 33 spam, 0 ham, Aug 19 2003; 16s/0h of 58857 corpus header RM_sw_MBA Subject =~ /\bMBA\b/i describe RM_sw_MBA Subject mentions an MBA score RM_sw_MBA 1.15 # 17s/0h of 58857 corpus # Interestingly, this MBA rule, when run through the obfuscation system, # matches no additional spam in my corpus, but does match two ham. header RM_sw_Partnership Subject =~ /Partnership/i describe RM_sw_Partnership Subject mentions Partnership score RM_sw_Partnership 1.20 # 23s/0h of 58857 corpus header CMO_RM_sw_Proven Subject =~ /\bp[\W_]?[r\xAE][\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[vu][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[n\xD1\xF1]\b/i describe CMO_RM_sw_Proven Proven in Subject score CMO_RM_sw_Proven 1.910 # 91s/0h of 58857 corpus header RM_sw_SpecialBang Subject =~ /Special\!/i describe RM_sw_SpecialBang Subject mentions a special! score RM_sw_SpecialBang 1.185 # 37s/1h of 58857 corpus header CMO_RM_sw_Timeshare Subject =~ /t[\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?(?:\/\\\/\\|\/V\\|rn|[m])[EMAIL PROTECTED]/i describe CMO_RM_sw_Timeshare Subject mentions timeshare(s) score CMO_RM_sw_Timeshare 1.200 # 20s/0h of 58857 corpus header CMO_RM_swd_debt Subject =~ /[d\xD0][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[b8\xDF][\W_]?t/i describe CMO_RM_swd_debt Subject mentions debt score CMO_RM_swd_debt 3.000 # 541s/0h of 58857 corpus header RM_swd_Foreclosure Subject =~ /Foreclosure/i describe RM_swd_Foreclosure Subject mentions foreclosure(s) score RM_swd_Foreclosure 1.19 # 20s/0h of 58857 corpus header RM_swd_investors Subject =~ /investors/i describe RM_swd_investors Subject mentions investors score RM_swd_investors 2.116 # 116s/0h of 58857 corpus header RM_swd_Paying Subject =~ /Paying/i describe RM_swd_Paying Subject mentions Paying for something score RM_swd_Paying 1.292 # 146s/4h of 58857 corpus header CMO_RM_swd_MoneyBang Subject =~ /(?:\/\\\/\\|\/V\\|rn|[m])[\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[n\xD1\xF1][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[y\xA5\xDD\xFD]!/i describe CMO_RM_swd_MoneyBang Subject mentions money with exclamation mark score CMO_RM_swd_MoneyBang 1.200 # 20s/0h of 58857 corpus header CMO_RM_swm_DrugsV Subject =~ /[vu][\W_]?[il1:\|[EMAIL PROTECTED]@\xC0-\xC5\xAA\xE0-\xE5]/i describe CMO_RM_swm_DrugsV Subject mentions known spam subject score CMO_RM_swm_DrugsV 20.00 # 1977s/0h of 58857 corpus header CMO_RM_swm_Medication Subject =~ /(?:\/\\\/\\|\/V\\|rn|[m])[\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[d\xD0][\W_]?[il1:\|[EMAIL PROTECTED]:\|\*\xCC-\xCF\xEC-\xEF][\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[n\xD1\xF1]/i describe CMO_RM_swm_Medication Subject mentions medication score CMO_RM_swm_Medication 3.000 # 597s/1h of 58857 corpus header RE_RM_swm_Meds Subject =~ /m(e|3)ds/i describe RE_RM_swm_Meds Meds in subject score RE_RM_swm_Meds 3.00 # 665s/0h of 58857 corpus # This rule also hits some ham when extensively obfuscated, without greatly improving the spam hits. header CMO_RM_swm_Younger Subject =~ /\b[y\xA5\xDD\xFD][\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[uv\*\xB5\xD9-\xDC\xF9-\xFC][\W_]?[n\xD1\xF1][\W_]?[g6][\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[r\xAE]\b/i describe CMO_RM_swm_Younger Younger in Subject score CMO_RM_swm_Younger 1.41 # 41s/0h of 58857 corpus header RM_swp_porn1 Subject =~ /\bporn/i describe RM_swp_porn1 Subject seems to be about porn score RM_swp_porn1 1.460 # 46s/0h of 58857 corpus header CMO_RM_swp_porn1 Subject =~ /(?!\bporn)\bp[\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[r\xAE][\W_]?[n\xD1\xF1]/i describe CMO_RM_swp_porn1 Subject seems to be about porn score CMO_RM_swp_porn1 1.280 # 28s/0h of 58857 corpus header RM_swp_porn2 Subject =~ /\bfuck/i describe RM_swp_porn2 Subject seems to be about porn score RM_swp_porn2 0.900 # 18s/1h of 58857 corpus header CMO_RM_swp_porn2 Subject =~ /(?!\bfuck)\bf[\W_]?[uv\*\xB5\xD9-\xDC\xF9-\xFC][\W_]?[c\xC7\xE7\xA2\xA9][\W_]?k/i describe CMO_RM_swp_porn2 Subject seems to be about porn score CMO_RM_swp_porn2 0.800 # 8s/0h of 58857 corpus header RM_swt_ConsWord6 Subject =~ /\b[bcghjklmnpqrstvwxz]{6,20}\b/ describe RM_swt_ConsWord6 subject contains word consisting of consecutive consonants score RM_swt_ConsWord6 3.000 # 550s/0h of 58857 corpus header CMO_RM_swt_Masked02 Subject =~ /(?!\bLOSE\b)\b[l1I\|\xA3][\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[s5\$\xA7][\W_]?[e3\*\xC8-\xCB\xE8-\xEB]\b/i describe CMO_RM_swt_Masked02 masked spam word(s) in subject score CMO_RM_swt_Masked02 4.000 # 20s/0h of 58857 corpus header CMO_RM_swt_Masked05 Subject =~ /(?!tion)t[\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[n\xD1\xF1]/i describe CMO_RM_swt_Masked05 masked spam word(s) in subject score CMO_RM_swt_Masked05 2.000 # 400s/3h of 58857 corpus -- ham: typo(2), word-word, header CMO_RM_swt_Masked06 Subject =~ /(?!\bcheap(er)?)[EMAIL PROTECTED]([e3\*\xC8-\xCB\xE8-\xEB][\W_]?[r\xAE])?/i describe CMO_RM_swt_Masked06 masked spam word(s) in subject score CMO_RM_swt_Masked06 4.000 # 86s/0h of 58857 corpus header CMO_RM_swt_Masked07 Subject =~ /(?!\bFor\b)\bf[\W_]?(?:\[\]|\(\)|[o0\*\xD2-\xD6\xF2-\xF6])[\W_]?[r\xAE]\b/i describe CMO_RM_swt_Masked07 masked spam word(s) in subject score CMO_RM_swt_Masked07 4.000 # 53s/0h of 58857 corpus header CMO_RM_swt_Masked14 Subject =~ /(?!\bgirls?)\b[g6][\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?[r\xAE][\W_]?[l1I\|\xA3][\W_]?[s5\$\xA7]?/i describe CMO_RM_swt_Masked14 masked spam word(s) in subject score CMO_RM_swt_Masked14 4.000 # 27s/0h of 58857 corpus header RM_swt_Masked19 Subject =~ /\bpenis\b/i describe RM_swt_Masked19 masked spam word(s) in subject score RM_swt_Masked19 3.000 # 344s/0h of 58857 corpus header CMO_RM_swt_Masked19 Subject =~ /(?!\bpenis\b)\bp[\W_]?[e3\*\xC8-\xCB\xE8-\xEB][\W_]?[n\xD1\xF1][\W_]?[il1:\|\*\xCC-\xCF\xEC-\xEF][\W_]?[s5\$\xA7]\b/i describe CMO_RM_swt_Masked19 masked spam word(s) in subject score CMO_RM_swt_Masked19 4.000 # 111s/0h of 58857 corpus header RM_sl_LettersNums Subject =~ /[a-z]{1,5}[0-9]{1,5}[a-z]{1,5}[0-9]{1,5}[a-z]{1,5}[0-9]{1,5}/ describe RM_sl_LettersNums Subject contains multiple mixed letters and numbers in one "word" score RM_sl_LettersNums 2.930 # 135 spam, 2 ham (Dell), Sep 12 2003; 193s/0h of 58857 corpus header RM_sl_RandomLetters2a Subject =~ /\b[cjnqstuvwxz][bgjqu]\b/i describe RM_sl_RandomLetters2a Subject contains random-text spamsign score RM_sl_RandomLetters2a 2.163 # 465s/3h of 58857 corpus header RM_sl_RandomLetters2b Subject =~ /\be[bfjkopqv]\b/i describe RM_sl_RandomLetters2b Subject contains random-text spamsign score RM_sl_RandomLetters2b 1.71 # 71s/0h of 58857 corpus header RM_sl_RandomLetters3a Subject =~ /\b[abehikmpqrsvwxyz]a[bjkquvz]\b/i describe RM_sl_RandomLetters3a Subject contains random-text spamsign score RM_sl_RandomLetters3a 1.375 # 75s/1h of 58857 corpus; ham: special anacronym header RM_sl_RandomLetters3b Subject =~ /\bx[bfghjklnpqrstwz][bfghjklnpqrstwz]\b/i describe RM_sl_RandomLetters3b Subject contains random-text spamsign score RM_sl_RandomLetters3b 1.580 # 58s/0h of 58857 corpus header RM_sl_RandomLetters3c Subject =~ / [fghjklnqrtz]{3} /i describe RM_sl_RandomLetters3c Subject contains random-text spamsign score RM_sl_RandomLetters3c 3.000 # 212s/0h of 58857 corpus # avoid PGP,DSL,BBQ,DNS,TBS,WWW,FTP,WFB,FWD,SSH,ksh,SQL,SSL,LTD,DDT,FDR header RM_sl_RandomLetters4a Subject =~ /\b[eiou][bfghjklnpqrtwz]{3}\b/i describe RM_sl_RandomLetters4a Subject contains random-text spamsign score RM_sl_RandomLetters4a 2.140 # 114s/0h of 58857 corpus # lots of ham with leading A header RM_sl_RandomLetters5a Subject =~ /\b[bcdfghjklnpqrvwz]{5}\b/i describe RM_sl_RandomLetters5a Subject contains random-text spamsign score RM_sl_RandomLetters5a 3.000 # 245s/0h of 58857 corpus header RM_sl_RandomCons6a Subject =~ /\b[bcdghjklmnpqrstvwxz]{6}\b/i describe RM_sl_RandomCons6a Subject contains random-text spamsign score RM_sl_RandomCons6a 3.000 # 325s/0h of 58857 corpus header RM_sl_RandomCons7a Subject =~ /\b[bcdfghjklmnpqrstvwxz]{7}\b/i describe RM_sl_RandomCons7a Subject contains random-text spamsign score RM_sl_RandomCons7a 2.097 # 329s/2h of 58857 corpus # ham: JDBGMGR hoax and response header RM_st_LongSubject Subject =~ /.{170,}/ describe RM_st_LongSubject Subject is excessively long -- more than 169 chars score RM_st_LongSubject 9.100 # 139s/0h of 58857 corpus header RM_st_RandomText Subject =~ /\%RANDOM_TEXT|\%RANDOM_WORD/i describe RM_st_RandomText Subject contains random-text spamsign score RM_st_RandomText 9.1 # 8 spam, 0 ham, Sep 5 2003; 3s/0h of 58857 corpus header RM_st_USAscii Subject:raw =~ /us-ascii/i describe RM_st_USAscii Subject specifies display in US-ascii, unnecessary unless spam hides subject score RM_st_USAscii 0.900 # 27s/2h of 58857 corpus, ham = MS Passport.com ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk