RE: [SAtalk] [RD] Popcorn, Backhair, and Weeds

Larry Gilson Tue, 28 Oct 2003 13:42:43 -0800

> -----Original Message-----
> From: Keith C. Ivey

Thanks for the reply Keith and sorry for the long dely in my response.

> Larry Gilson <[EMAIL PROTECTED]> wrote:
> 
> >   full  MY_FULL_OBFU_HTML  /[\s>]\w+<[\w\s\/\$&;]{1,6}>\w+/
> 
> It seems to me that you'd want to catch the obfuscating 
> pesudo-comments with '!' as well.  Have you tried it with '[^>]' as 
> the character class, so that you'll match regardless of what's 
> in the angle brackets?

The pseudo-comments are captured in a different rule.  The obfuscating HTML
tags have a number of false positives in the {1,6} range that the
pseudo-comments do not so that rule will cover {1,150} where the HTML rule
covers {6,150}.  I am experimenting how to reduce the FPs in the {1,6}
range.

> Also, why do you require whitespace or '>' before the first 
> sequence of word characters?  What if there's a '-' or a '(' 
> there instead.  Have you tried leaving it off completely, in 
> which case the '+' after the '\w' is unnecessary (in fact, the 
> '+' after the last '\w' isn't doing anything now).  Then the 
> regex would look like this:
> 
>    /\w<[^>]{1,6}>\w/

I tried that and got more FPs than I wanted.  Starting the match with [\s>]
just about eliminates the FPs without reducing the effectiveness.

> I still think you're going to get too many FPs, though.  This 
> problem may be something better tackled during the HTML 
> analysis.  There could be a counter for bad tags (perhaps 
> separate ones for tags that are illegally formed and those that 
> are simply unrecognized).  Then a series of eval tests could 
> use the count.  Avoiding FPs for XML documents could be a 
> problem though.

It sounds to me that the implementation you are talking about will require
meta rules that compare multiple pattern matches rather than trying to force
one pattern to match.  I believe you are correct.  I am blad you mention
XML.  I have not had a problem with XML yet but I would not doubt it is
close at hand.

> > To try to curb the FPs for tests within the {1,5} range, I will 
> > experiment with the following rule:
> > 
> >   full  MY_FULL_OBFU_HTML  /([\s>]\w+<[\w\s\/\$&;]{1,6}>\w+){2,}/
> 
> That will only match when one word is interrupted by more than 
> one obfuscating pseudo-tag.

I guess I was hoping that I could match two obfuscated words listed
sequentially.  While I may find a legitimate:

  Sincerely<br>George Banks<br>

I never find:

  Sincerely<br>George Banks<BR>President

However, I do realize that there are more situations than just what I
experience.  You are probably on the right track in that an HTML analysis is
needed.  I would imagine that we are headed down the road of producing an
eval test rather than just a series of rules or even meta rules.

Thanks for your input Keith!

--Larry

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?   SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

RE: [SAtalk] [RD] Popcorn, Backhair, and Weeds

Reply via email to