Re: Advanced regex question - backtracking vs. negative lookaheads

Jeremy Fairbrass Tue, 25 Apr 2006 06:35:06 -0700

Thanks guys for the clarifications! My understanding of how regex worked was
the same as Bowie's, ie:
-----
> My understanding is that with [^"]+ the engine will scan from left to
> right until it finds a quote.  Then, in the context of the previous
> regex, it will start backtracking to find a match for "color:blue".
-----

I  use the free Regex Coach tool from http://www.weitz.de/regex-coach/ to
test my regex, and it works the way Bowie described above, ie. using
backtracking. In other words, using:

    /style="[^>]+color:blue/

...the [^>]+ causes the regex to go all the way to the closing > character,
then backtracks until it finds the "color:blue" part. This also agrees with
what is explained at www.regular-expressions.info which I believe is a
reliable guide to Perl regex.

Also, Bowie suggested using laziness instead:

    /style="[^"]+?color:blue/

But I believe laziness also uses backtracking, so I'm not sure there is
*much* of an advantage of this over the greedy regex shown above. Probably
the main advantage of the lazy version would be if there was little or no
text between the first quote-mark and the "color:blue" part, and/or lots of
text between "color:blue" and the last quote-mark, eg:

    <span style="color:blue; font-size:small; border:0px">

...The regex would hit this much quicker using the lazy version than the
greedy version. But I'm not sure if there really is a difference, especially
if I want to be able to hit on SPAN tags that might have more text before
the "color:blue" OR might have more text afterwards. Probably it's six of
one and half a dozen of the other, right?! Why did David describe the lazy
version as "slightly less good" than the greedy version?

Incidentally the reason I used [^>]+ rather than [^"]+ was to prevent it
from using lots of memory if there was no closing quote - as an alternative
to using {1,20}.

In any case, both Bowie and David agree that my first solution using
(.(?!color))+ is a really bad idea, and that was the main thing I wanted to
know! :)

Thanks,
Jeremy

"Bowie Bailey" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> David Landgren wrote:
>> Bowie Bailey wrote:
>>
>> [...]
>>
>> > > An alternative solution would be this:
>> > >
>> > > /style="[^>]+color:blue/
>> >
>> > This looks better.  It is probably less resource-intensive than
>> > your previous attempt and is definitely easier to read.  But why
>> > are you looking for > when you anchor the beginning with a quote?
>> >
>> > How about this:
>> >
>> >     /style="[^"]+?color:blue/
>> >
>> > This is also non-greedy, so it will start looking for the
>> > "color:blue" match at the beginning of the string instead of
>> > having the + slurp up everything up to the quote and then
>> > backtracking to find the match.
>>
>> The regexp engine doesn't slurp. It just scans from left to right,
>> noting "I might have to come back here" along the way.
>
> Ok, so "slurp" was a bit of a simplification.  :)
>
> My understanding is that with [^"]+ the engine will scan from left to
> right until it finds a quote.  Then, in the context of the previous
> regex, it will start backtracking to find a match for "color:blue".
>
> In any case, with the non-greedy quantifier, it will stop looking when
> it finds the first "color:blue" string instead of continuing to the
> end of the string.
>
> -- 
> Bowie
>

Re: Advanced regex question - backtracking vs. negative lookaheads

Reply via email to