Hi Keith,

Au contraire.  That is exactly it.  That explanation was beautiful! ( I
long for your brain.  :)  )  Thank you for taking the time to make that so
clear!

The rules actually work, but I suspected they were filled with garbage. 
Thanks for cleaning them up!  I'll put your shorn version on the page. 
http://spamhammers.nxtek.net  Maybe you could peek at them and get a
better idea of what we're trying to do.  There are examples of what they
match (which is what you describe below other than the range larry
changed) ...littered hidden tags in the body.  And they add up in spam. 
The problem Larry is working on is one I couldn't figure out when I
"wrote" these abominations. Which is this...

I didn't know how to match '[\w\s]{,150}' ,the '<hidden junk tag obscuring
the word>' and miss legitimate tags such as <b>,<li> or any other tag up
to 6 letters.  I was afraid of FPs if I didn't set that high enough.  As
it is, it hits on <center> but scores low enough that I settled on that
hit just to catch more occurrences.

A second question I have is how to include all the characters they mix
into the junk tag such as '<g$b>', without breaking the rule.  I tried \S
in my ignorance, and realized it would hit on later hidden tags before it
stopped matching.  I only saw the rules hitting on spams (written that
way) but they were, in my opinion, out of control and I didn't wait to
find out if they hit ham as well. I changed them back and settled for what
I had.

Third and final... it seems to me that the two sets (popcorn and backhair)
could be combined into one ruleset by someone who understands this better
than I do, which is most likely any creature that has the ability to
manipulate a keyboard.  I tried to combine them, but decided to go ahead
and post them since they do work as is.  This doesn't matter to me really.
 I made the second set only because I couldn't figure out any other way to
match both examples in that link.

The rules work great, but I would love it if you or someone else could
tweak them to match smaller tags (like in the following) and miss real
tags.

No ne<k$h>ed t<k^t>o drea<k$t>m...you can now <b>ex<kv>pan<kv$d>d
you<k-l>r john<kg>son up t<z*l>o 3 in<ib>ch<k$j>e<k$a>s

And if not, that is okay too.  I'm satisfied with what they're giving me
now.  I only posted these at the urging of a friend and after seeing how
much they were helping out with a sudden boatload of spam breezing
through.

The link above may shed more light if I didn't make this clear and you
would like to see the set.

Thanks again for the great explanation!!  wow
Jennifer

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Keith
C. Ivey
Sent: Saturday, October 11, 2003 5:02 PM
To: [EMAIL PROTECTED]
Subject: RE: [SAtalk] Popcorn, Backhair, and Weeds

Larry Gilson <[EMAIL PROTECTED]> wrote:

> I had the following HTML tag OBFU rule (variant of yours):
>   /(\>|\s)\w{1,5}?\<\/?\s?[\w\s]{6,150}\/?\s?\>\w{1,7}?(\s|\W|\<)/

There's a lot of clutter in that that makes it harder to
follow.  Let's try paring it down.  First, '<' and '>' are not
special on their own in regexes, so there's no need to
backslash them:

/(>|\s)\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?(\s|\W|<)/

When you have an alternation -- something like '(a|b|c)' --
where all the alternatives are single characters, it's better
to write it as a character class -- something like '[abc]'.
Also, '\s' and '<' are both included in '\W', so that last
alternation is equivalent to just '\W':

/[>\s]\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?\W/

Now, nongreedy matching serves no purpose when the thing
following it can't be matched by the thing being repeated.  In
this case you have '\w{1,5}?' followed by '<', but '<' can't
match '\w', so there's no difference between greedy and
nongreedy matching there.  The matching for the series of '\w'
characters has to go all the way to the '<' -- it can't stop
short.  Similarly, the '\W' at the end can never match the '\w'
preceding it, so that '?' is also pointless:

/[>\s]\w{1,5}<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}\W/

That regex is equivalent to your original one, and may help you
see better why it's not matching as you expect.  It's looking
for

   a '>' or whitespace character (space, tab, carriage return,
      line feed, form feed),
   followed by 1 to 5 word characters (letters, numbers, and
      underscores),
   followed by '<',
   followed by an optional '/',
   followed by an optional single whitespace character,
   followed by 6 to 150 word or whitespace characters,
   followed by an optional '/',
   followed by an optional single whitespace character,
   followed by '>',
   followed by 1 to 7 word characters,
   followed by a nonword character (anything other than
      letters, numbers, and underscore).

I'm not clear on what you want to match, but that's probably
not it.

-- 
Keith C. Ivey <[EMAIL PROTECTED]>
Washington, DC



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk




-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to