Re: exegesis 5 question: matching negative, multi-byte strings

esp5 Tue, 01 Oct 2002 19:49:09 -0700

On Tue, Oct 01, 2002 at 05:24:43PM -0400, Peter Behroozi wrote:
> On Tue, 2002-10-01 at 16:44, [EMAIL PROTECTED] wrote:
> > doesn't work (just tried it out, not sure why it doesn't) but even if it did,
> > it would be awful slow. It would try one character, look at the next for the 
> > string union, come back for the next character, look for the string union,
> > etc. etc. etc.
> > 
> > whereas
> > 
> > ([^u]+|u[^n]....)
> > 
> > doesn't do any backtracking at all..
> > 
> > Ed
> 
> perl -e ' $a = "this has the string union in it"; 
> my ($b) = ($a =~ m"((?:(?!union).)*)"); print $b;'
> 
> prints the desired result for me at least.  It also should be comparably


whoops. Must have mistyped. Works for me now.

> efficient to the alternation since the match for the string 'union'
> should fail if the first character is not 'u', etc.  The alternation
> also matches a character at a time except in special cases, where I am
> reasonably sure that the extra overhead from alternation compensates for
> multi-character matching.  This method also does no backtracking for the
> provided example; I am not sure what made you think that it did.
> 
> Peter
> 

well, when I said backtracking, I meant it didn't flip between the current 
character and the next. I couldn't check real numbers doing benchmarking 
because the ?! construct core dumps on both perl-5.6.1 and perl-5.8 on large 
strings.

But when benchmarked on small (30 line strings) using:

my $regex1 = qr{(?:(?!union).)*}sx;
my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;

timethese
(100000,
    {   
        'questionbang' => sub { my ($b) = ($line =~ m"($regex1)"); },
        'alternation'   => sub { my ($b) = ($line =~ m"($regex2)"); }
    }
);

I get:

Benchmark: timing 100000 iterations of alternation, questionbang...
alternation: 11 wallclock secs (10.66 usr +  0.00 sys = 10.66 CPU) @ 9380.86/s 
(n=100000)
questionbang: 18 wallclock secs (18.81 usr +  0.00 sys = 18.81 CPU) @ 5316.32/s 
(n=100000)

so ?! is a bit slower. It could probably be made faster though.

However, I'm still skeptical as it being a good replacement for the alternation.
Look at my posted message (about making the regex be able to handle nested 
parens, etc) and see if you can come up with an easy way handle the case I 
mentioned there..

Ed

Re: exegesis 5 question: matching negative, multi-byte strings

Reply via email to