On Wed, Oct 02, 2002 at 10:39:17AM +0300, Markus Laire wrote: > On 1 Oct 2002 at 18:47, [EMAIL PROTECTED] wrote: > > > > > all text up to, but not including the string "union". > > > > > > rule getstuffbeforeunion { (.*?) union | (.*) } > > > > > > "a union" => "a " > > > "b" => "b" > > > > hmm... well, it works, but its not very efficient. It basically > > scans the whole string to the end to see if there is a "union" string, and > > then backtracks to take the alternative. And hence, its not very scalable. > > It also doesn't 'complexify' very well. > > What about > > Perl 5: /(.*?)(?:union|$)/ > Perl 6: /(.*?) [union | $$]/ > > or if you want to exlude 'union' from match > > Perl 5: /(.*?)(?=union|$)/ > Perl 6: /(.*?) [<after: union> | $$]/ >
that's exceedingly slow, at least by my benchmark. So far, I've got 4 possibilities: my $regex1 = qr{(?:(?!union).)*}sx; my $regex2 = qr{(?:[^u]+|u[^n]|un[^i]|uni[^o]|unio[^n])*}sx; my $regex3 = qr{(?:[^u]+|(?!union).)*}sx; my $regex4 = qr{(.*?)(?=union|$)}sx; timethese ( 100000, { 'questionbang' => sub { ($line =~ m"($regex1)"); }, 'questionbang2' => sub { ($line =~ m"($regex3)"); }, 'alternation' => sub { ($line =~ m"($regex2)"); } 'nongreedy' => sub { ($line =~ m"($regex4)"); }, } ); which come out: alternation: 8 wallclock secs ( 7.71 usr + 0.00 sys = 7.71 CPU) @ 12970.17/s (n=100000) questionbang: 17 wallclock secs (16.05 usr + 0.00 sys = 16.05 CPU) @ 6230.53/s (n=100000) questionbang2: 8 wallclock secs ( 7.74 usr + 0.00 sys = 7.74 CPU) @ 12919.90/s (n=100000) nongreedy: 41 wallclock secs (41.74 usr + 0.00 sys = 41.74 CPU) @ 2395.78/s (n=100000) So yes, a form can be constructed out of ?! which is of approximately equal speed to the alternation. However, in straight C, the corresponding time is: 2.31u 0.02s 0:02.37 98.3% which tells me that a lot of optimisation could be made with a generic mechanism for (non)matching multi-byte character classes. The problem has to be dealt with anyways when considering unicode... And which form would people rather type: (<-[^u]>+|(?!union).)* or <-[^'union']>* I'd say the second scores over the first in intuition, if nothing else... Ed