[SAtalk] Longwords

Robert Menschel Mon, 26 Jan 2004 21:27:46 -0800

This is a forwarded message
From: Robert Menschel <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Date: Saturday, January 24, 2004, 7:10:18 PM
Subject: [RulesEmporium] Longwords


===8<==============Original message text===============
Received an email this morning which reminded me about my longwords
rules, which apparently got lost when I migrated my mass-check system
from my mail server to my PC.

This was my exploration of the random words spammers have been including
at the bottom of their emails, or in their text portions, or in their
invisible text, to confuse some anti-spam software. (I call these words
Bayes Fodder, since over time it seems they are helping my Bayes identify
spam better and better and better.)

Anyway, I rebuilt, reran, refined, and:

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
  91714    74113    17601    0.808   0.00    0.00  (all messages)
   7431     7429        2    0.999   1.00   3.00  RM_bpt_longwords68a
   6596     6595        1    0.999   0.98   1.00  RM_bpt_longwords69a
   4163     4163        0    1.000   0.71   2.00  RM_bpt_longwords78a
   8761     8753        8    0.996   0.51   3.00  RM_bpt_longwords59a
   2950     2950        0    1.000   0.48   1.00  RM_bpt_longwords79a
   1162     1162        0    1.000   0.15   4.00  RM_bpt_longwords96a
   1025     1025        0    1.000   0.13   4.00  RM_bpt_longwords88a
    590      590        0    1.000   0.05   1.00  RM_bpt_longwords89a
    545      545        0    1.000   0.04   3.00  RM_bpt_longwords97
    442      442        0    1.000   0.02   1.00  RM_bpt_longwords98
    330      330        0    1.000   0.00   1.00  RM_bpt_longwords99

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  91714    74113    17601    0.808   0.00    0.00  (all messages)
100.000  80.8088  19.1912    0.808   0.00    0.00  (all messages as %)
  8.102  10.0239   0.0114    0.999   1.00    3.00  RM_bpt_longwords68a
  7.192   8.8986   0.0057    0.999   0.98    1.00  RM_bpt_longwords69a
  4.539   5.6171   0.0000    1.000   0.71    2.00  RM_bpt_longwords78a
  9.553  11.8103   0.0455    0.996   0.51    3.00  RM_bpt_longwords59a
  3.217   3.9804   0.0000    1.000   0.48    1.00  RM_bpt_longwords79a
  1.267   1.5679   0.0000    1.000   0.15    4.00  RM_bpt_longwords96a
  1.118   1.3830   0.0000    1.000   0.13    4.00  RM_bpt_longwords88a
  0.643   0.7961   0.0000    1.000   0.05    1.00  RM_bpt_longwords89a
  0.594   0.7354   0.0000    1.000   0.04    3.00  RM_bpt_longwords97
  0.482   0.5964   0.0000    1.000   0.02    1.00  RM_bpt_longwords98
  0.360   0.4453   0.0000    1.000   0.00    1.00  RM_bpt_longwords99

Scores of course are set to my 9.0 required hits, so you'll probably want
to lower these scores. Depending on your system, an initial score of 0.5
or 1.0 for each rule might be worth while, and then you can increase the
scores slowly if these spam continue to sneak past your system.

In my 19k corpus, one ham matches three of these rules, two of which I've
scored at 3.0, and so that ham gets a score of 7.0 of 9. I may be
reducing those rules to 2.5 or 2.0 instead of 3.0 once I complete my next
global mass-check. So yes, caution is advised.

Bob Menschel

body     RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/
describe RM_bpt_longwords68a Long string of long words
score    RM_bpt_longwords68a 3.000  # 7429s/2h of 91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list, 
                                    # "improving compatibility between computer 
platforms demands certain levels "
body     RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/
describe RM_bpt_longwords69a Long string of long words
score    RM_bpt_longwords69a 1.000  # type=max:1 (add to 59a,68a) - 6595s/1h of 91714 
corpus (74113s/17601h) 01/23/04
                                    # ham: userid list
body     RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/
describe RM_bpt_longwords78a Long string of long words
score    RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714 
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/
describe RM_bpt_longwords59a Long string of long words
score    RM_bpt_longwords59a 3.000  # 8753s/8h of 91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list
body     RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/
describe RM_bpt_longwords79a Long string of long words
score    RM_bpt_longwords79a 1.000  # type=max:1 (add to 78a) - 2950s/0h of 91714 
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/
describe RM_bpt_longwords96a Long string of long words
score    RM_bpt_longwords96a 4.000  # 1162s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/
describe RM_bpt_longwords88a Long string of long words
score    RM_bpt_longwords88a 4.000  # 1025s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/
describe RM_bpt_longwords89a Long string of long words
score    RM_bpt_longwords89a 1.000  # type=max:1 (add to 88a) - 590s/0h of 91714 
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/
describe RM_bpt_longwords97 Long string of long words
score    RM_bpt_longwords97 3.000  # 545s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/
describe RM_bpt_longwords98 Long string of long words
score    RM_bpt_longwords98 1.000  # type=max:1 (add to 97) - 442s/0h of 91714 corpus 
(74113s/17601h) 01/23/04
body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
describe RM_bpt_longwords99 Long string of long words
score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) - 330s/0h of 91714 corpus 
(74113s/17601h) 01/23/04







-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Longwords

Reply via email to