On Fri, 15 Dec 2000, Tom Christiansen wrote:

> >As for special-case rules, I believe that my proposed modification would
> >REMOVE a special-case semantic rule, at the cost of added complexity at the
> >implementation level.  
> 
> What is this alleged "special-case rule" you are talking about?
> There is no such thing.  None.  When you write /pat/, it means to
> find the first such pattern.  There is no special case here.

The special case is "as long as it has the earliest starting position".

There may be many, many possible matches for a regexp in a given string,
especially with an expression as inclusive as ".*".  Especially since many
of those matches overlap, it would be pretty worthless to get back EVERY
single possible match within a string -- too much duplication.

So, you have to apply some disambiguating rules to identify which matches
are "interesting" enough to be worth paying attention to.  The first rule
that was applied was to prefer the longest match, the "greediness" rule.
Then, to prefer the earliest match, to disambiguate between competing,
independent possible matches, and to find an answer sooner when searching.
No doubt, these rules were probably in the very first usable regexp system.

Perl 5 changed the rules a bit.  Rather than always disambiguating in favor
of the longest match, the "non-greedy" option can now reverse that, into a
preference for the shortest match.  Unfortunately, this creates a conflict
with the earliest-match rule that never existed with greedy matching.  This
was resolved in Perl 5 by insisting that leftmost always trumps non-greedy,
which is simple, if a little arbitrary.

I'm suggesting that BOTH rule preferences should be applied to the greatest
extent possible, and only when the conflict is unavoidable should leftmost
take precedence to resolve the conflict.  It's somewhat of a "minimax"
algorithm, rather than a strict priority.

You can't explain why "bbbbccccd" matches without making reference to the
absolute priority of the leftmost rule.  "bccccd" would still make sense
(locally) without reference to that rule.

If we want the first interesting match, and we're preferring early matches
and short matches, I believe that "bccccd" is more interesting.

Deven

Reply via email to