Hello Pierre,

Saturday, January 17, 2004, 6:28:37 PM, you wrote:

PT> Bob,

PT> Thanks for the mass check.  I don't have a big corpus handy,
PT> just what trickles through the gateway.

PT> There should be no problem with a few extra keywords; we could
PT> even squeeze "postmaster" in there for good measure, though rules
PT> which line-wrap sometimes cause grief for text downloads.

>>Also, the valid HTML tags are valid regardless of case, eg:
>><BLOCKQUOTE> is a valid HTML tag, but excluded in your rule.

PT> My original test only looks for lowercase strings, so there is
PT> no need to make exceptions for valid uppercase tags.  So far I have
PT> only seen lowercase bogus tags in spam.  How does an overall /i
PT> modifier affect inverse matches anyhow?  Will your version match
PT> <HAIRSPRAY> and not match <BLOCKQUOTE> ?

Good point.  I've removed the /i from my copy.  Results:

rawbody  PT_BOGUS_HTML   
/\<\/?(?!(?:blockquote|optiongroup|plaintext|fontfamily|underline))[a-z]{9,15}\>/
describe PT_BOGUS_HTML   random long words disguised as HTML tags
score    PT_BOGUS_HTML   4.000  # 9628s/2h of 92209 corpus (74874s/17335h) 01/17/04

Bob Menschel





-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to