On 15 Dec 2003 11:57:57 +0200, [EMAIL PROTECTED] writes: > On Fri, 12 Dec 2003 09:35:18 -0800 (PST), Regis Wilson > <[EMAIL PROTECTED]> posted to spamassassin-talk: > > I read some articles that urge against the {} notation. Can > > someone enlighten me? > > It's a Perl efficiency thing. If you are confident that your rule > works OK, you may want to change it slightly to avoid the braces. Look > in the archives for recent postings from Scott A Crosby. > <http://search.gmane.org/search.php?group=gmane.mail.spam.spamassassin.general&query=crosby> >
I must apologize. The truth is messy, but lies are simple and can be mostly correct. {} aren't necessarily bad, but they can make it easier to make bad patterns. There's no difference between [xy]{4,} and [xy][xy][xy][xy][xy]*. (There is between x{4,} and xxxxx* because in the second case, it can use the literal match optimization on 'xxxx') As given, the regexp: /(\n.{74}=){3,}/ is about as good as you can get for using the RE hammer to hit this nail. Since '\n' is disjoint from '.', and it is a literal, Perl's literal optimization would scan for a \n, then immediately start the RE engine, which wouldl try to gobble up 74 non-newlines (no non-determinism here!), followed by an '=', then repeat gobbling up 74-char lines until it can't. If it finds finds it matched less than three, it'll give up. If Keith's suggestion of change it to: /(\n.{74}=){3}/ It would give up after exactly 3 lines matched, which would not change the worst case but would be some improvement. Since there is nothing after the {3,}, there's no issue of backing off and having it try to match fewer lines. But, had the pattern been: /(\n.{74}=){3,}Foo/ Then perl would gobble up as many 74-char lines as possible, try to find 'Foo'. If that failed, it would remember where each line ended and backoff to the prior line and try to match 'Foo', giving up when it had backed off to less than 3 lines. Anyways, /(\n.{74}=){3}/ will take linear time to match because there's no nondeterminism in the pattern at all. Well, thats actually a lie. Perl has to try to match this RE at all points. Its really trying to see if this pattern matches the entire input: /(.|\n)*(\n.{74}=){3}/ Fortunately for this pattern, the literal optimization will scan for a '\n', then enter the RE engine at only those points to see if the entire [origional] pattern matches. Although it doesn't look it, and I thought on first glance it wasn't, and it probably wasn't deliberately designed this way, the pattern is surprisingly robust. In the Perl matching engine, the worst case would require the engine to scan each character three times. This is one of the reasons I have mostly given up on analyzing regexp patterns for denial of service attacks. The literal optimization combined with how people use regexps almost always makes robust patterns. Scott ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk