Hi larry, This seems to work, it matches what I think you're trying to do, but It's otherwise untested. (matched 'w<p§R>ork')
/[>\s]\w{1,7}<\/?\s?[\w\s]{1,20}\S{1}[\w\s]{1,20}\/?\s?>\w{1,7}\W/i I'd test it out before I gave it much of a score. Maybe you can do something with that. There is a rule in spamassassin that searches for an obfu comment, you could also try editing that rule to do what you're working on if this isn't working out. I like the additive nature of the set, and eventually I'll figure out how to get the set to be a little more far reaching in the tags, but for now I need a break :) this stuff is all very new to me and my brain says no mas! ...yeah, keith's explanation was nice!! Like music. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Larry Gilson Sent: Sunday, October 12, 2003 12:18 AM To: [EMAIL PROTECTED] Subject: RE: [SAtalk] Popcorn, Backhair, and Weeds Hi Keith, Thanks for the your time and knowledge! The explanation of your rule contraction is excellent and really helped me understand the process much better. This exercise has been a very good learning experience for me thanks to Jennifer Wheeler. She authored a set of rules (http://spamhammers.nxtek.net) to catch word obfuscation by HTML tags, script tags, and HTML encoding. I have been trying to reduce the number of rules and find possible holes. I think that all who have contributed to this thread are to credit for the ground gained. Your insight has really helped! Thanks Again, Larry > -----Original Message----- > From: Keith C. Ivey > Sent: Saturday, October 11, 2003 6:02 PM > To: [EMAIL PROTECTED] > Subject: RE: [SAtalk] Popcorn, Backhair, and Weeds > > > Larry Gilson <[EMAIL PROTECTED]> wrote: > > > I had the following HTML tag OBFU rule (variant of yours): > > /(\>|\s)\w{1,5}?\<\/?\s?[\w\s]{6,150}\/?\s?\>\w{1,7}?(\s|\W|\<)/ > > There's a lot of clutter in that that makes it harder to > follow. Let's try paring it down. First, '<' and '>' are not > special on their own in regexes, so there's no need to > backslash them: > > /(>|\s)\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?(\s|\W|<)/ > > When you have an alternation -- something like '(a|b|c)' -- > where all the alternatives are single characters, it's better > to write it as a character class -- something like '[abc]'. > Also, '\s' and '<' are both included in '\W', so that last > alternation is equivalent to just '\W': > > /[>\s]\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?\W/ > > Now, nongreedy matching serves no purpose when the thing > following it can't be matched by the thing being repeated. In > this case you have '\w{1,5}?' followed by '<', but '<' can't > match '\w', so there's no difference between greedy and > nongreedy matching there. The matching for the series of '\w' > characters has to go all the way to the '<' -- it can't stop > short. Similarly, the '\W' at the end can never match the '\w' > preceding it, so that '?' is also pointless: > > /[>\s]\w{1,5}<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}\W/ > > That regex is equivalent to your original one, and may help you > see better why it's not matching as you expect. It's looking > for > > a '>' or whitespace character (space, tab, carriage return, > line feed, form feed), > followed by 1 to 5 word characters (letters, numbers, and > underscores), > followed by '<', > followed by an optional '/', > followed by an optional single whitespace character, > followed by 6 to 150 word or whitespace characters, > followed by an optional '/', > followed by an optional single whitespace character, > followed by '>', > followed by 1 to 7 word characters, > followed by a nonword character (anything other than > letters, numbers, and underscore). > > I'm not clear on what you want to match, but that's probably > not it. > > -- > Keith C. Ivey <[EMAIL PROTECTED]> > Washington, DC ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk