> -----Original Message----- > From: Keith C. Ivey Thanks for the reply Keith and sorry for the long dely in my response.
> Larry Gilson <[EMAIL PROTECTED]> wrote: > > > full MY_FULL_OBFU_HTML /[\s>]\w+<[\w\s\/\$&;]{1,6}>\w+/ > > It seems to me that you'd want to catch the obfuscating > pesudo-comments with '!' as well. Have you tried it with '[^>]' as > the character class, so that you'll match regardless of what's > in the angle brackets? The pseudo-comments are captured in a different rule. The obfuscating HTML tags have a number of false positives in the {1,6} range that the pseudo-comments do not so that rule will cover {1,150} where the HTML rule covers {6,150}. I am experimenting how to reduce the FPs in the {1,6} range. > Also, why do you require whitespace or '>' before the first > sequence of word characters? What if there's a '-' or a '(' > there instead. Have you tried leaving it off completely, in > which case the '+' after the '\w' is unnecessary (in fact, the > '+' after the last '\w' isn't doing anything now). Then the > regex would look like this: > > /\w<[^>]{1,6}>\w/ I tried that and got more FPs than I wanted. Starting the match with [\s>] just about eliminates the FPs without reducing the effectiveness. > I still think you're going to get too many FPs, though. This > problem may be something better tackled during the HTML > analysis. There could be a counter for bad tags (perhaps > separate ones for tags that are illegally formed and those that > are simply unrecognized). Then a series of eval tests could > use the count. Avoiding FPs for XML documents could be a > problem though. It sounds to me that the implementation you are talking about will require meta rules that compare multiple pattern matches rather than trying to force one pattern to match. I believe you are correct. I am blad you mention XML. I have not had a problem with XML yet but I would not doubt it is close at hand. > > To try to curb the FPs for tests within the {1,5} range, I will > > experiment with the following rule: > > > > full MY_FULL_OBFU_HTML /([\s>]\w+<[\w\s\/\$&;]{1,6}>\w+){2,}/ > > That will only match when one word is interrupted by more than > one obfuscating pseudo-tag. I guess I was hoping that I could match two obfuscated words listed sequentially. While I may find a legitimate: Sincerely<br>George Banks<br> I never find: Sincerely<br>George Banks<BR>President However, I do realize that there are more situations than just what I experience. You are probably on the right track in that an HTML analysis is needed. I would imagine that we are headed down the road of producing an eval test rather than just a series of rules or even meta rules. Thanks for your input Keith! --Larry ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk