Among the recommendations for detecting spam with bayes fodder within it, were:
rawbody WORDWORD /[a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} / describe WORDWORD long string of random words score WORDWORD 2.0 and rawbody WORDWORD2 /\b(?:[a-z]{4,12}\s+){12}/ describe WORDWORD2 long string of random words score WORDWORD2 2.0 Running these against my corpus, I find WORDWORD -- 4212s/14h of 87289 corpus (70035s/17254h) WORDWORD2 -- 4205s/12h of 87289 corpus (70035s/17254h) I'm working on a similar sort of idea, and built several rules and worked them through mass-check. As of last night I have the following frequencies: OVERALL SPAM HAM S/O SCORE NAME 87289 70035 17254 0.802 0.00 0.00 (all messages) 4518 4517 1 0.999 1.00 0.00 RM_bpt_longwords69m 5154 5152 2 0.998 1.00 0.27 RM_bpt_longwords68m 2635 2635 0 1.000 1.00 1.00 RM_bpt_longwords78m 1899 1899 0 1.000 1.00 0.00 RM_bpt_longwords79m 927 927 0 1.000 0.99 0.00 RM_bpt_longwords96m 791 791 0 1.000 0.99 0.00 RM_bpt_longwords88m 2720 2719 1 0.999 0.99 1.00 RM_bpt_longwords78l 591 591 0 1.000 0.99 1.00 RM_bpt_longwords89m 573 573 0 1.000 0.99 1.00 RM_bpt_longwords97 528 528 0 1.000 0.99 1.00 RM_bpt_longwords97l 507 507 0 1.000 0.99 0.50 RM_bpt_longwords98 499 499 0 1.000 0.99 1.00 RM_bpt_longwords97m 483 483 0 1.000 0.99 0.50 RM_bpt_longwords99 471 471 0 1.000 0.99 0.50 RM_bpt_longwords98l 448 448 0 1.000 0.99 0.50 RM_bpt_longwords99l 441 441 0 1.000 0.99 0.50 RM_bpt_longwords98m 421 421 0 1.000 0.99 0.50 RM_bpt_longwords99m 4703 4699 4 0.997 0.99 0.00 RM_bpt_longwords69 4657 4653 4 0.997 0.99 0.00 RM_bpt_longwords69l 1942 1941 1 0.998 0.99 0.00 RM_bpt_longwords79l 5797 5790 7 0.995 0.99 0.54 RM_bpt_longwords67m 6185 6177 8 0.995 0.99 1.28 RM_bpt_longwords59m 2764 2762 2 0.997 0.99 1.00 RM_bpt_longwords78 1979 1977 2 0.996 0.98 0.00 RM_bpt_longwords79 3607 3602 5 0.994 0.98 0.48 RM_bpt_longwords77m 958 957 1 0.996 0.97 0.00 RM_bpt_longwords96l 830 829 1 0.995 0.97 0.00 RM_bpt_longwords88l 5347 5336 11 0.992 0.97 0.27 RM_bpt_longwords68l 1293 1291 2 0.994 0.97 0.00 RM_bpt_longwords96 1252 1250 2 0.994 0.97 0.00 RM_bpt_longwords87m 5414 5401 13 0.990 0.97 0.27 RM_bpt_longwords68 627 626 1 0.994 0.96 1.00 RM_bpt_longwords89l 2271 2266 5 0.991 0.96 0.27 RM_bpt_longwords86m 1341 1338 3 0.991 0.96 0.00 RM_bpt_longwords87 1295 1292 3 0.991 0.95 0.00 RM_bpt_longwords87l 2686 2679 7 0.990 0.95 0.27 RM_bpt_longwords86 869 867 2 0.991 0.95 0.00 RM_bpt_longwords88 3763 3752 11 0.988 0.95 0.48 RM_bpt_longwords77 3708 3697 11 0.988 0.95 0.48 RM_bpt_longwords77l 6522 6499 23 0.986 0.95 0.47 RM_bpt_longwords58m 2339 2332 7 0.988 0.95 0.27 RM_bpt_longwords86l 664 662 2 0.988 0.94 1.00 RM_bpt_longwords89 1571 1566 5 0.987 0.94 0.00 RM_bpt_longwords95m 6493 6462 31 0.981 0.93 1.28 RM_bpt_longwords59l 6558 6526 32 0.980 0.93 1.28 RM_bpt_longwords59 4972 4948 24 0.981 0.92 0.06 RM_bpt_longwords76m 3667 3647 20 0.978 0.91 0.97 RM_bpt_longwords85m 6119 6072 47 0.970 0.88 0.54 RM_bpt_longwords67l 6214 6166 48 0.969 0.88 0.54 RM_bpt_longwords67 1635 1623 12 0.971 0.87 0.00 RM_bpt_longwords95l 6976 6917 59 0.967 0.87 0.00 RM_bpt_longwords66m 4250 4215 35 0.967 0.87 0.97 RM_bpt_longwords85 3882 3850 32 0.967 0.86 0.97 RM_bpt_longwords85l 1989 1973 16 0.968 0.86 0.00 RM_bpt_longwords95 5586 5523 63 0.956 0.82 0.06 RM_bpt_longwords76 7231 7147 84 0.954 0.82 0.22 RM_bpt_longwords57m 7042 6956 86 0.952 0.81 0.47 RM_bpt_longwords58l 7142 7054 88 0.952 0.81 0.47 RM_bpt_longwords58 5174 5111 63 0.952 0.81 0.06 RM_bpt_longwords76l 6689 6566 123 0.929 0.73 0.20 RM_bpt_longwords75m 8136 7942 194 0.910 0.66 0.00 RM_bpt_longwords66 7668 7484 184 0.909 0.65 0.00 RM_bpt_longwords66l 8097 7854 243 0.888 0.58 0.22 RM_bpt_longwords57l 8245 7996 249 0.888 0.58 0.22 RM_bpt_longwords57 8167 7907 260 0.882 0.56 0.20 RM_bpt_longwords75 7806 7552 254 0.880 0.55 0.20 RM_bpt_longwords75l 9127 8802 325 0.870 0.52 0.50 RM_bpt_longwords65m 9105 8719 386 0.848 0.45 0.50 RM_bpt_longwords56m 13367 12597 770 0.801 0.33 0.50 RM_bpt_longwords65l 13640 12832 808 0.796 0.31 0.50 RM_bpt_longwords65 11809 10989 820 0.768 0.23 0.50 RM_bpt_longwords56 11126 10347 779 0.766 0.23 0.50 RM_bpt_longwords56l 14103 12699 1404 0.690 0.06 0.50 RM_bpt_longwords55m 21752 19304 2448 0.660 0.02 0.50 RM_bpt_longwords55l 22457 19810 2647 0.648 0.00 0.50 RM_bpt_longwords55 I hope to be able to suggest a ruleset of a dozen or fewer rules without too much delay that which will help identify/flag these types of spam. Bob Menschel Thursday, January 8, 2004, 6:48:57 PM, Chris wrote: >> Looks good. just running this over a ham mail box with about 500 messages >> and a spam mail box with the same, and not decoding base64 and such, I >> see the following: CP> what about something like: CP> /(?:\b(?!=(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){12}/ CP> I'm trying to think of extremely common 4-letter words, so this is CP> probably just a quick example. >> I tend to like the idea of weighting the 10 sequence low, say 0.5, >> and the 13 sequence would get an extra bump of 2.0 more (making a >> total of 2.5). CP> That makes sense. Though I'd probably go with 10 low, and 15 high (like CP> 3 or more). But that's just me: CP> rawbody WORDWORD_10 CP> /(?:\b(?!=(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/ CP> describe WORDWORD_10 string of 10+ random words CP> score WORDWORD_10 .5 CP> rawbody WORDWORD_15 CP> /(?:\b(?!=(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/ CP> describe WORDWORD_15 string of 15+ random words CP> score WORDWORD_15 2.5 -- Best regards, Robert mailto:[EMAIL PROTECTED] ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk