Hi Keith, Au contraire. That is exactly it. That explanation was beautiful! ( I long for your brain. :) ) Thank you for taking the time to make that so clear!
The rules actually work, but I suspected they were filled with garbage. Thanks for cleaning them up! I'll put your shorn version on the page. http://spamhammers.nxtek.net Maybe you could peek at them and get a better idea of what we're trying to do. There are examples of what they match (which is what you describe below other than the range larry changed) ...littered hidden tags in the body. And they add up in spam. The problem Larry is working on is one I couldn't figure out when I "wrote" these abominations. Which is this... I didn't know how to match '[\w\s]{,150}' ,the '<hidden junk tag obscuring the word>' and miss legitimate tags such as <b>,<li> or any other tag up to 6 letters. I was afraid of FPs if I didn't set that high enough. As it is, it hits on <center> but scores low enough that I settled on that hit just to catch more occurrences. A second question I have is how to include all the characters they mix into the junk tag such as '<g$b>', without breaking the rule. I tried \S in my ignorance, and realized it would hit on later hidden tags before it stopped matching. I only saw the rules hitting on spams (written that way) but they were, in my opinion, out of control and I didn't wait to find out if they hit ham as well. I changed them back and settled for what I had. Third and final... it seems to me that the two sets (popcorn and backhair) could be combined into one ruleset by someone who understands this better than I do, which is most likely any creature that has the ability to manipulate a keyboard. I tried to combine them, but decided to go ahead and post them since they do work as is. This doesn't matter to me really. I made the second set only because I couldn't figure out any other way to match both examples in that link. The rules work great, but I would love it if you or someone else could tweak them to match smaller tags (like in the following) and miss real tags. No ne<k$h>ed t<k^t>o drea<k$t>m...you can now <b>ex<kv>pan<kv$d>d you<k-l>r john<kg>son up t<z*l>o 3 in<ib>ch<k$j>e<k$a>s And if not, that is okay too. I'm satisfied with what they're giving me now. I only posted these at the urging of a friend and after seeing how much they were helping out with a sudden boatload of spam breezing through. The link above may shed more light if I didn't make this clear and you would like to see the set. Thanks again for the great explanation!! wow Jennifer -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Keith C. Ivey Sent: Saturday, October 11, 2003 5:02 PM To: [EMAIL PROTECTED] Subject: RE: [SAtalk] Popcorn, Backhair, and Weeds Larry Gilson <[EMAIL PROTECTED]> wrote: > I had the following HTML tag OBFU rule (variant of yours): > /(\>|\s)\w{1,5}?\<\/?\s?[\w\s]{6,150}\/?\s?\>\w{1,7}?(\s|\W|\<)/ There's a lot of clutter in that that makes it harder to follow. Let's try paring it down. First, '<' and '>' are not special on their own in regexes, so there's no need to backslash them: /(>|\s)\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?(\s|\W|<)/ When you have an alternation -- something like '(a|b|c)' -- where all the alternatives are single characters, it's better to write it as a character class -- something like '[abc]'. Also, '\s' and '<' are both included in '\W', so that last alternation is equivalent to just '\W': /[>\s]\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?\W/ Now, nongreedy matching serves no purpose when the thing following it can't be matched by the thing being repeated. In this case you have '\w{1,5}?' followed by '<', but '<' can't match '\w', so there's no difference between greedy and nongreedy matching there. The matching for the series of '\w' characters has to go all the way to the '<' -- it can't stop short. Similarly, the '\W' at the end can never match the '\w' preceding it, so that '?' is also pointless: /[>\s]\w{1,5}<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}\W/ That regex is equivalent to your original one, and may help you see better why it's not matching as you expect. It's looking for a '>' or whitespace character (space, tab, carriage return, line feed, form feed), followed by 1 to 5 word characters (letters, numbers, and underscores), followed by '<', followed by an optional '/', followed by an optional single whitespace character, followed by 6 to 150 word or whitespace characters, followed by an optional '/', followed by an optional single whitespace character, followed by '>', followed by 1 to 7 word characters, followed by a nonword character (anything other than letters, numbers, and underscore). I'm not clear on what you want to match, but that's probably not it. -- Keith C. Ivey <[EMAIL PROTECTED]> Washington, DC ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk