Re[2]: [SAtalk] detecting large collections of random words

Robert Menschel Thu, 08 Jan 2004 23:41:18 -0800

Among the recommendations for detecting spam with bayes fodder within it,
were:


rawbody WORDWORD        /[a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} 
[a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} [a-z]{4,12} /
describe WORDWORD       long string of random words
score WORDWORD          2.0

and
rawbody  WORDWORD2        /\b(?:[a-z]{4,12}\s+){12}/
describe WORDWORD2       long string of random words
score    WORDWORD2          2.0

Running these against my corpus, I find
WORDWORD  -- 4212s/14h of 87289 corpus (70035s/17254h)
WORDWORD2 -- 4205s/12h of 87289 corpus (70035s/17254h)

I'm working on a similar sort of idea, and built several rules and worked
them through mass-check.  As of last night I have the following
frequencies: 

OVERALL     SPAM      HAM     S/O   SCORE  NAME
  87289    70035    17254    0.802   0.00    0.00  (all messages)
   4518     4517        1    0.999   1.00   0.00  RM_bpt_longwords69m
   5154     5152        2    0.998   1.00   0.27  RM_bpt_longwords68m
   2635     2635        0    1.000   1.00   1.00  RM_bpt_longwords78m
   1899     1899        0    1.000   1.00   0.00  RM_bpt_longwords79m
    927      927        0    1.000   0.99   0.00  RM_bpt_longwords96m
    791      791        0    1.000   0.99   0.00  RM_bpt_longwords88m
   2720     2719        1    0.999   0.99   1.00  RM_bpt_longwords78l
    591      591        0    1.000   0.99   1.00  RM_bpt_longwords89m
    573      573        0    1.000   0.99   1.00  RM_bpt_longwords97
    528      528        0    1.000   0.99   1.00  RM_bpt_longwords97l
    507      507        0    1.000   0.99   0.50  RM_bpt_longwords98
    499      499        0    1.000   0.99   1.00  RM_bpt_longwords97m
    483      483        0    1.000   0.99   0.50  RM_bpt_longwords99
    471      471        0    1.000   0.99   0.50  RM_bpt_longwords98l
    448      448        0    1.000   0.99   0.50  RM_bpt_longwords99l
    441      441        0    1.000   0.99   0.50  RM_bpt_longwords98m
    421      421        0    1.000   0.99   0.50  RM_bpt_longwords99m
   4703     4699        4    0.997   0.99   0.00  RM_bpt_longwords69
   4657     4653        4    0.997   0.99   0.00  RM_bpt_longwords69l
   1942     1941        1    0.998   0.99   0.00  RM_bpt_longwords79l
   5797     5790        7    0.995   0.99   0.54  RM_bpt_longwords67m
   6185     6177        8    0.995   0.99   1.28  RM_bpt_longwords59m
   2764     2762        2    0.997   0.99   1.00  RM_bpt_longwords78
   1979     1977        2    0.996   0.98   0.00  RM_bpt_longwords79
   3607     3602        5    0.994   0.98   0.48  RM_bpt_longwords77m
    958      957        1    0.996   0.97   0.00  RM_bpt_longwords96l
    830      829        1    0.995   0.97   0.00  RM_bpt_longwords88l
   5347     5336       11    0.992   0.97   0.27  RM_bpt_longwords68l
   1293     1291        2    0.994   0.97   0.00  RM_bpt_longwords96
   1252     1250        2    0.994   0.97   0.00  RM_bpt_longwords87m
   5414     5401       13    0.990   0.97   0.27  RM_bpt_longwords68
    627      626        1    0.994   0.96   1.00  RM_bpt_longwords89l
   2271     2266        5    0.991   0.96   0.27  RM_bpt_longwords86m
   1341     1338        3    0.991   0.96   0.00  RM_bpt_longwords87
   1295     1292        3    0.991   0.95   0.00  RM_bpt_longwords87l
   2686     2679        7    0.990   0.95   0.27  RM_bpt_longwords86
    869      867        2    0.991   0.95   0.00  RM_bpt_longwords88
   3763     3752       11    0.988   0.95   0.48  RM_bpt_longwords77
   3708     3697       11    0.988   0.95   0.48  RM_bpt_longwords77l
   6522     6499       23    0.986   0.95   0.47  RM_bpt_longwords58m
   2339     2332        7    0.988   0.95   0.27  RM_bpt_longwords86l
    664      662        2    0.988   0.94   1.00  RM_bpt_longwords89
   1571     1566        5    0.987   0.94   0.00  RM_bpt_longwords95m
   6493     6462       31    0.981   0.93   1.28  RM_bpt_longwords59l
   6558     6526       32    0.980   0.93   1.28  RM_bpt_longwords59
   4972     4948       24    0.981   0.92   0.06  RM_bpt_longwords76m
   3667     3647       20    0.978   0.91   0.97  RM_bpt_longwords85m
   6119     6072       47    0.970   0.88   0.54  RM_bpt_longwords67l
   6214     6166       48    0.969   0.88   0.54  RM_bpt_longwords67
   1635     1623       12    0.971   0.87   0.00  RM_bpt_longwords95l
   6976     6917       59    0.967   0.87   0.00  RM_bpt_longwords66m
   4250     4215       35    0.967   0.87   0.97  RM_bpt_longwords85
   3882     3850       32    0.967   0.86   0.97  RM_bpt_longwords85l
   1989     1973       16    0.968   0.86   0.00  RM_bpt_longwords95
   5586     5523       63    0.956   0.82   0.06  RM_bpt_longwords76
   7231     7147       84    0.954   0.82   0.22  RM_bpt_longwords57m
   7042     6956       86    0.952   0.81   0.47  RM_bpt_longwords58l
   7142     7054       88    0.952   0.81   0.47  RM_bpt_longwords58
   5174     5111       63    0.952   0.81   0.06  RM_bpt_longwords76l
   6689     6566      123    0.929   0.73   0.20  RM_bpt_longwords75m
   8136     7942      194    0.910   0.66   0.00  RM_bpt_longwords66
   7668     7484      184    0.909   0.65   0.00  RM_bpt_longwords66l
   8097     7854      243    0.888   0.58   0.22  RM_bpt_longwords57l
   8245     7996      249    0.888   0.58   0.22  RM_bpt_longwords57
   8167     7907      260    0.882   0.56   0.20  RM_bpt_longwords75
   7806     7552      254    0.880   0.55   0.20  RM_bpt_longwords75l
   9127     8802      325    0.870   0.52   0.50  RM_bpt_longwords65m
   9105     8719      386    0.848   0.45   0.50  RM_bpt_longwords56m
  13367    12597      770    0.801   0.33   0.50  RM_bpt_longwords65l
  13640    12832      808    0.796   0.31   0.50  RM_bpt_longwords65
  11809    10989      820    0.768   0.23   0.50  RM_bpt_longwords56
  11126    10347      779    0.766   0.23   0.50  RM_bpt_longwords56l
  14103    12699     1404    0.690   0.06   0.50  RM_bpt_longwords55m
  21752    19304     2448    0.660   0.02   0.50  RM_bpt_longwords55l
  22457    19810     2647    0.648   0.00   0.50  RM_bpt_longwords55

I hope to be able to suggest a ruleset of a dozen or fewer rules without
too much delay that which will help identify/flag these types of spam.

Bob Menschel




Thursday, January 8, 2004, 6:48:57 PM, Chris wrote:

>> Looks good. just running this over a ham mail box with about 500 messages
>> and a spam mail box with the same, and not decoding base64 and such, I
>> see the following:

CP> what about something like:

CP> /(?:\b(?!=(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){12}/

CP> I'm trying to think of extremely common 4-letter words, so this is
CP> probably just a quick example.

>> I tend to like the idea of weighting the 10 sequence low, say 0.5,
>> and the 13 sequence would get an extra bump of 2.0 more (making a
>> total of 2.5).

CP> That makes sense.  Though I'd probably go with 10 low, and 15 high (like
CP> 3 or more).  But that's just me:

CP> rawbody WORDWORD_10       
CP> /(?:\b(?!=(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/
CP> describe WORDWORD_10       string of 10+ random words
CP> score WORDWORD_10          .5
                                                                                       
                  
CP> rawbody WORDWORD_15       
CP> /(?:\b(?!=(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/
CP> describe WORDWORD_15       string of 15+ random words
CP> score WORDWORD_15          2.5
                                                                                       
                  




-- 
Best regards,
 Robert                            mailto:[EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re[2]: [SAtalk] detecting large collections of random words

Reply via email to