This is a forwarded message From: Robert Menschel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Date: Saturday, January 24, 2004, 7:10:18 PM Subject: [RulesEmporium] Longwords
===8<==============Original message text=============== Received an email this morning which reminded me about my longwords rules, which apparently got lost when I migrated my mass-check system from my mail server to my PC. This was my exploration of the random words spammers have been including at the bottom of their emails, or in their text portions, or in their invisible text, to confuse some anti-spam software. (I call these words Bayes Fodder, since over time it seems they are helping my Bayes identify spam better and better and better.) Anyway, I rebuilt, reran, refined, and: Section 3 -- Frequencies Log (First numeric frequencies, followed by percentage frequencies) OVERALL SPAM HAM S/O SCORE NAME 91714 74113 17601 0.808 0.00 0.00 (all messages) 7431 7429 2 0.999 1.00 3.00 RM_bpt_longwords68a 6596 6595 1 0.999 0.98 1.00 RM_bpt_longwords69a 4163 4163 0 1.000 0.71 2.00 RM_bpt_longwords78a 8761 8753 8 0.996 0.51 3.00 RM_bpt_longwords59a 2950 2950 0 1.000 0.48 1.00 RM_bpt_longwords79a 1162 1162 0 1.000 0.15 4.00 RM_bpt_longwords96a 1025 1025 0 1.000 0.13 4.00 RM_bpt_longwords88a 590 590 0 1.000 0.05 1.00 RM_bpt_longwords89a 545 545 0 1.000 0.04 3.00 RM_bpt_longwords97 442 442 0 1.000 0.02 1.00 RM_bpt_longwords98 330 330 0 1.000 0.00 1.00 RM_bpt_longwords99 OVERALL% SPAM% HAM% S/O RANK SCORE NAME 91714 74113 17601 0.808 0.00 0.00 (all messages) 100.000 80.8088 19.1912 0.808 0.00 0.00 (all messages as %) 8.102 10.0239 0.0114 0.999 1.00 3.00 RM_bpt_longwords68a 7.192 8.8986 0.0057 0.999 0.98 1.00 RM_bpt_longwords69a 4.539 5.6171 0.0000 1.000 0.71 2.00 RM_bpt_longwords78a 9.553 11.8103 0.0455 0.996 0.51 3.00 RM_bpt_longwords59a 3.217 3.9804 0.0000 1.000 0.48 1.00 RM_bpt_longwords79a 1.267 1.5679 0.0000 1.000 0.15 4.00 RM_bpt_longwords96a 1.118 1.3830 0.0000 1.000 0.13 4.00 RM_bpt_longwords88a 0.643 0.7961 0.0000 1.000 0.05 1.00 RM_bpt_longwords89a 0.594 0.7354 0.0000 1.000 0.04 3.00 RM_bpt_longwords97 0.482 0.5964 0.0000 1.000 0.02 1.00 RM_bpt_longwords98 0.360 0.4453 0.0000 1.000 0.00 1.00 RM_bpt_longwords99 Scores of course are set to my 9.0 required hits, so you'll probably want to lower these scores. Depending on your system, an initial score of 0.5 or 1.0 for each rule might be worth while, and then you can increase the scores slowly if these spam continue to sneak past your system. In my 19k corpus, one ham matches three of these rules, two of which I've scored at 3.0, and so that ham gets a score of 7.0 of 9. I may be reducing those rules to 2.5 or 2.0 instead of 3.0 once I complete my next global mass-check. So yes, caution is advised. Bob Menschel body RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/ describe RM_bpt_longwords68a Long string of long words score RM_bpt_longwords68a 3.000 # 7429s/2h of 91714 corpus (74113s/17601h) 01/23/04 # ham: userid list, # "improving compatibility between computer platforms demands certain levels " body RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/ describe RM_bpt_longwords69a Long string of long words score RM_bpt_longwords69a 1.000 # type=max:1 (add to 59a,68a) - 6595s/1h of 91714 corpus (74113s/17601h) 01/23/04 # ham: userid list body RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/ describe RM_bpt_longwords78a Long string of long words score RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/ describe RM_bpt_longwords59a Long string of long words score RM_bpt_longwords59a 3.000 # 8753s/8h of 91714 corpus (74113s/17601h) 01/23/04 # ham: userid list body RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/ describe RM_bpt_longwords79a Long string of long words score RM_bpt_longwords79a 1.000 # type=max:1 (add to 78a) - 2950s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/ describe RM_bpt_longwords96a Long string of long words score RM_bpt_longwords96a 4.000 # 1162s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/ describe RM_bpt_longwords88a Long string of long words score RM_bpt_longwords88a 4.000 # 1025s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/ describe RM_bpt_longwords89a Long string of long words score RM_bpt_longwords89a 1.000 # type=max:1 (add to 88a) - 590s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/ describe RM_bpt_longwords97 Long string of long words score RM_bpt_longwords97 3.000 # 545s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/ describe RM_bpt_longwords98 Long string of long words score RM_bpt_longwords98 1.000 # type=max:1 (add to 97) - 442s/0h of 91714 corpus (74113s/17601h) 01/23/04 body RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/ describe RM_bpt_longwords99 Long string of long words score RM_bpt_longwords99 1.000 # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04 ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk