Hello Pierre,

Saturday, January 17, 2004, 9:30:47 AM, you wrote:

PT> I made a rule that catches many of these bogus HTML tags, based
PT> on the fact that there are only three valid standalone tags of 9
PT> characters or more (according to the list at
PT> http://devedge.netscape.com/library/xref/2001/html-element/ ):

PT> # check for invalid HTML tags of 9 characters or more

PT> rawbody PT_BOGUS_HTML /\<\/?(?!(?:blockquote|optiongroup|plaintext))[a-z]{9,15}\>/
PT> describe PT_BOGUS_HTML   random long words disguised as HTML tags
PT> score PT_BOGUS_HTML      1.0

It appears to be a decent rule from my mass-check:
> PT_BOGUS_HTML -- 9385s/18h of 92209 corpus (74874s/17335h) 01/17/04

Ham matches:

Valid email bounce message, with:
> For further assistance, please send mail to <postmaster>

YahooGroups mailing list email HTML seems to frequently include lines
like:
>   <fontfamily><param>arial</param><smaller>ADVERTISEMENT
> </smaller></fontfamily>
< <underline><color><param>1999,1999,FFFF</param>Yahoo! Terms of
> Service</color></underline>.</bigger></fixed></excerpt>

They're not standard HTML, but if they appear regularly in ham, the rule
should probably allow for them.

Also, the valid HTML tags are valid regardless of case, eg: <BLOCKQUOTE>
is a valid HTML tag, but excluded in your rule.

So I'd recommend enhancing your rule to something like:
> rawbody PT_BOGUS_HTML  
> /\<\/?(?!(?:blockquote|optiongroup|plaintext|fontfamily|underline))[a-z]{9,15}\>/i

Do you see any problem with this?

I'll be kicking off another mass-check on this version soon.

Bob Menschel





-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to