Bob,

Thanks for the mass check.  I don't have a big corpus handy, just what trickles 
through the gateway.

There should be no problem with a few extra keywords; we could even squeeze 
"postmaster" in there for good measure, though rules which line-wrap sometimes cause 
grief for text downloads.

>Also, the valid HTML tags are valid regardless of case, eg: <BLOCKQUOTE> is a valid 
>HTML tag, but excluded in your rule.

My original test only looks for lowercase strings, so there is no need to make 
exceptions for valid uppercase tags.  So far I have only seen lowercase bogus tags in 
spam.  How does an overall /i modifier affect inverse matches anyhow?  Will your 
version match <HAIRSPRAY> and not match <BLOCKQUOTE> ?

Pierre




-----Original Message-----
From: Robert Menschel [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 17, 2004 9:02 PM
To: Pierre Thomson
Cc: [EMAIL PROTECTED]
Subject: Re: [SAtalk] Re: Filter rule f. invalid HTML tags?


Hello Pierre,

Saturday, January 17, 2004, 9:30:47 AM, you wrote:

PT> I made a rule that catches many of these bogus HTML tags, based
PT> on the fact that there are only three valid standalone tags of 9
PT> characters or more (according to the list at
PT> http://devedge.netscape.com/library/xref/2001/html-element/ ):

PT> # check for invalid HTML tags of 9 characters or more

PT> rawbody PT_BOGUS_HTML /\<\/?(?!(?:blockquote|optiongroup|plaintext))[a-z]{9,15}\>/
PT> describe PT_BOGUS_HTML   random long words disguised as HTML tags
PT> score PT_BOGUS_HTML      1.0

It appears to be a decent rule from my mass-check:
> PT_BOGUS_HTML -- 9385s/18h of 92209 corpus (74874s/17335h) 01/17/04

Ham matches:

Valid email bounce message, with:
> For further assistance, please send mail to <postmaster>

YahooGroups mailing list email HTML seems to frequently include lines
like:
>   <fontfamily><param>arial</param><smaller>ADVERTISEMENT
> </smaller></fontfamily>
< <underline><color><param>1999,1999,FFFF</param>Yahoo! Terms of
> Service</color></underline>.</bigger></fixed></excerpt>

They're not standard HTML, but if they appear regularly in ham, the rule
should probably allow for them.

Also, the valid HTML tags are valid regardless of case, eg: <BLOCKQUOTE>
is a valid HTML tag, but excluded in your rule.

So I'd recommend enhancing your rule to something like:
> rawbody PT_BOGUS_HTML  
> /\<\/?(?!(?:blockquote|optiongroup|plaintext|fontfamily|underline))[a-z]{9,15}\>/i

Do you see any problem with this?

I'll be kicking off another mass-check on this version soon.

Bob Menschel





-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to