Larry Gilson <[EMAIL PROTECTED]> wrote:

>   full  MY_FULL_OBFU_HTML  /[\s>]\w+<[\w\s\/\$&;]{1,6}>\w+/

It seems to me that you'd want to catch the obfuscating pesudo-
comments with '!' as well.  Have you tried it with '[^>]' as 
the character class, so that you'll match regardless of what's 
in the angle brackets?

Also, why do you require whitespace or '>' before the first 
sequence of word characters?  What if there's a '-' or a '(' 
there instead.  Have you tried leaving it off completely, in 
which case the '+' after the '\w' is unnecessary (in fact, the 
'+' after the last '\w' isn't doing anything now).  Then the 
regex would look like this:

   /\w<[^>]{1,6}>\w/

I still think you're going to get too many FPs, though.  This 
problem may be something better tackled during the HTML 
analysis.  There could be a counter for bad tags (perhaps 
separate ones for tags that are illegally formed and those that 
are simply unrecognized).  Then a series of eval tests could 
use the count.  Avoiding FPs for XML documents could be a 
problem though.

> To try to curb the FPs for tests within the {1,5} range, I will experiment
> with the following rule:
> 
>   full  MY_FULL_OBFU_HTML  /([\s>]\w+<[\w\s\/\$&;]{1,6}>\w+){2,}/

That will only match when one word is interrupted by more than 
one obfuscating pseudo-tag.

-- 
Keith C. Ivey <[EMAIL PROTECTED]>
Washington, DC



-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to