On 15 Dec 2003 11:57:57 +0200, [EMAIL PROTECTED] writes:

> On Fri, 12 Dec 2003 09:35:18 -0800 (PST), Regis Wilson
> <[EMAIL PROTECTED]> posted to spamassassin-talk:
>  > I read some articles that urge against the {} notation.  Can
>  > someone enlighten me?
> 
> It's a Perl efficiency thing. If you are confident that your rule
> works OK, you may want to change it slightly to avoid the braces. Look
> in the archives for recent postings from Scott A Crosby.
> <http://search.gmane.org/search.php?group=gmane.mail.spam.spamassassin.general&query=crosby>
> 

I must apologize. The truth is messy, but lies are simple and can be
mostly correct.

{} aren't necessarily bad, but they can make it easier to make bad
patterns. There's no difference between [xy]{4,} and
[xy][xy][xy][xy][xy]*. (There is between x{4,} and xxxxx* because in
the second case, it can use the literal match optimization on 'xxxx')

As given, the regexp:

    /(\n.{74}=){3,}/

is about as good as you can get for using the RE hammer to hit this nail.

Since '\n' is disjoint from '.', and it is a literal, Perl's literal
optimization would scan for a \n, then immediately start the RE
engine, which wouldl try to gobble up 74 non-newlines (no
non-determinism here!), followed by an '=', then repeat gobbling up
74-char lines until it can't. If it finds finds it matched less than
three, it'll give up. If Keith's suggestion of change it to:

    /(\n.{74}=){3}/

It would give up after exactly 3 lines matched, which would not change
the worst case but would be some improvement.

Since there is nothing after the {3,}, there's no issue of backing off
and having it try to match fewer lines. But, had the pattern been:

    /(\n.{74}=){3,}Foo/

Then perl would gobble up as many 74-char lines as possible, try to
find 'Foo'. If that failed, it would remember where each line ended
and backoff to the prior line and try to match 'Foo', giving up when
it had backed off to less than 3 lines.

Anyways, 

    /(\n.{74}=){3}/

will take linear time to match because there's no nondeterminism in
the pattern at all. Well, thats actually a lie. Perl has to try to
match this RE at all points. Its really trying to see if this pattern
matches the entire input:

    /(.|\n)*(\n.{74}=){3}/

Fortunately for this pattern, the literal optimization will scan for a
'\n', then enter the RE engine at only those points to see if the
entire [origional] pattern matches. Although it doesn't look it, and
I thought on first glance it wasn't, and it probably wasn't
deliberately designed this way, the pattern is surprisingly robust. In
the Perl matching engine, the worst case would require the engine to
scan each character three times.

This is one of the reasons I have mostly given up on analyzing regexp
patterns for denial of service attacks. The literal optimization
combined with how people use regexps almost always makes robust
patterns.

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to