Re: exegesis 5 question: matching negative, multi-byte strings

esp5 Wed, 02 Oct 2002 11:06:32 -0700

On Wed, Oct 02, 2002 at 10:39:17AM +0300, Markus Laire wrote:
> On 1 Oct 2002 at 18:47, [EMAIL PROTECTED] wrote:
> 
> > > > all text up to, but not including the string "union".
> > >
> > > rule getstuffbeforeunion { (.*?) union | (.*) }
> > > 
> > > "a union" => "a "
> > > "b" => "b"
> > 
> > hmm... well, it works, but its not very efficient. It basically 
> > scans the whole string to the end to see if there is a "union" string, and 
> > then backtracks to take the alternative. And hence, its not very scalable. 
> > It also doesn't 'complexify' very well.
> 
> What about
> 
> Perl 5:   /(.*?)(?:union|$)/
> Perl 6:   /(.*?) [union | $$]/
> 
> or if you want to exlude 'union' from match
> 
> Perl 5:   /(.*?)(?=union|$)/
> Perl 6:   /(.*?) [<after: union> | $$]/
>


that's exceedingly slow, at least by my benchmark. So far, I've got 4 
possibilities:

        my $regex1 = qr{(?:(?!union).)*}sx;
        my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx;
        my $regex3 = qr{(?:[^u]+|(?!union).)*}sx;
        my $regex4 = qr{(.*?)(?=union|$)}sx;

        timethese
        (
                100000,
        {
                'questionbang'  => sub { ($line =~ m"($regex1)"); },
                'questionbang2' => sub { ($line =~ m"($regex3)"); },
                'alternation'   => sub { ($line =~ m"($regex2)"); }
                'nongreedy'     => sub { ($line =~ m"($regex4)"); },
                }
        );


which come out:

alternation:  8 wallclock secs ( 7.71 usr +  0.00 sys =  7.71 CPU) @ 12970.17/s 
(n=100000)
questionbang: 17 wallclock secs (16.05 usr +  0.00 sys = 16.05 CPU) @ 6230.53/s 
(n=100000)
questionbang2:  8 wallclock secs ( 7.74 usr +  0.00 sys =  7.74 CPU) @ 12919.90/s 
(n=100000)
nongreedy: 41 wallclock secs (41.74 usr +  0.00 sys = 41.74 CPU) @ 2395.78/s (n=100000)


So yes, a form can be constructed out of ?! which is of approximately equal 
speed to the alternation.

However, in straight C, the corresponding time is:

2.31u 0.02s 0:02.37 98.3%

which tells me that a lot of optimisation could be made with a generic 
mechanism for (non)matching multi-byte character classes. The problem has 
to be dealt with anyways when considering unicode... And which form would people
rather type:

        (<-[^u]>+|(?!union).)*

or
        <-[^'union']>*

I'd say the second scores over the first in intuition, if nothing else...

Ed

Re: exegesis 5 question: matching negative, multi-byte strings

Reply via email to