Larry Gilson <[EMAIL PROTECTED]> wrote: > full MY_FULL_OBFU_HTML /[\s>]\w+<[\w\s\/\$&;]{1,6}>\w+/
It seems to me that you'd want to catch the obfuscating pesudo- comments with '!' as well. Have you tried it with '[^>]' as the character class, so that you'll match regardless of what's in the angle brackets? Also, why do you require whitespace or '>' before the first sequence of word characters? What if there's a '-' or a '(' there instead. Have you tried leaving it off completely, in which case the '+' after the '\w' is unnecessary (in fact, the '+' after the last '\w' isn't doing anything now). Then the regex would look like this: /\w<[^>]{1,6}>\w/ I still think you're going to get too many FPs, though. This problem may be something better tackled during the HTML analysis. There could be a counter for bad tags (perhaps separate ones for tags that are illegally formed and those that are simply unrecognized). Then a series of eval tests could use the count. Avoiding FPs for XML documents could be a problem though. > To try to curb the FPs for tests within the {1,5} range, I will experiment > with the following rule: > > full MY_FULL_OBFU_HTML /([\s>]\w+<[\w\s\/\$&;]{1,6}>\w+){2,}/ That will only match when one word is interrupted by more than one obfuscating pseudo-tag. -- Keith C. Ivey <[EMAIL PROTECTED]> Washington, DC ------------------------------------------------------- This SF.net email is sponsored by: The SF.net Donation Program. Do you like what SourceForge.net is doing for the Open Source Community? Make a contribution, and help us add new features and functionality. Click here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk