Hello Pierre, Saturday, January 17, 2004, 9:30:47 AM, you wrote:
PT> I made a rule that catches many of these bogus HTML tags, based PT> on the fact that there are only three valid standalone tags of 9 PT> characters or more (according to the list at PT> http://devedge.netscape.com/library/xref/2001/html-element/ ): PT> # check for invalid HTML tags of 9 characters or more PT> rawbody PT_BOGUS_HTML /\<\/?(?!(?:blockquote|optiongroup|plaintext))[a-z]{9,15}\>/ PT> describe PT_BOGUS_HTML random long words disguised as HTML tags PT> score PT_BOGUS_HTML 1.0 It appears to be a decent rule from my mass-check: > PT_BOGUS_HTML -- 9385s/18h of 92209 corpus (74874s/17335h) 01/17/04 Ham matches: Valid email bounce message, with: > For further assistance, please send mail to <postmaster> YahooGroups mailing list email HTML seems to frequently include lines like: > <fontfamily><param>arial</param><smaller>ADVERTISEMENT > </smaller></fontfamily> < <underline><color><param>1999,1999,FFFF</param>Yahoo! Terms of > Service</color></underline>.</bigger></fixed></excerpt> They're not standard HTML, but if they appear regularly in ham, the rule should probably allow for them. Also, the valid HTML tags are valid regardless of case, eg: <BLOCKQUOTE> is a valid HTML tag, but excluded in your rule. So I'd recommend enhancing your rule to something like: > rawbody PT_BOGUS_HTML > /\<\/?(?!(?:blockquote|optiongroup|plaintext|fontfamily|underline))[a-z]{9,15}\>/i Do you see any problem with this? I'll be kicking off another mass-check on this version soon. Bob Menschel ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk