I've noticed several spam mails with a lot of quoted text (quotes from Dave Barry, some of Moby Dick, that sort of thing. Usually all punction is stripped out, but not always.) included within brackets or an HTML title. It's likely being used to counterweight the message against a Bayesian filter, since most of the words generally also appear in ham. I made two rules to catch this. It doesn't seem like it'd bring up false positives (perhaps increasing the title length past 80), and works quite well against my corpus, but are there any problems I'm overlooking with this approach?
rawbody L_Text_Padding_In_Html /<(title>)?[ '-.,?!\w]{50,}>/ describe L_Text_Padding_In_Html Text padding within brackets or HTML title to fool bayesian filter score L_Text_Padding_In_Html 3.0 rawbody L_Very_Long_Title /<title>[ '-.,?!\w]{80,}<\/title>/ describe L_Very_Long_Title HTML title longer than 80 characters to fool bayesian filter score L_Very_Long_Title 1.0 Thanks, sckot Vokes -- "I wish I had a 2 liter of Pepsi in my box of replacement staples, so if they needed to quench their thirst, then they could ride the snake." -Kefka P ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk