> What exactly is matched by \g and \G is controlled by two new special
> variables, @^g and @^G, which are arrays of strings.
These sorts of global variables have been a problem in the past.
Since they change the meaning of the \g and \G escapes, I think they
should be pragmas or some other declaration that has a lexical scope.
This puzzle actually pops up in your RFC:
It is a run-time error to compile a regular expression that
contains \g or \G while the @^g and @^G arrays do not contain
the same number of elements.
If it is a run-time error to compile a regex, that means that the
regex compilation is occuring at run time. That is a recipe for very
slow regexes. Regex compilation needs to happen at compile time
except in special cases.
If the declarations have lexical scope, then Perl will be able to
optimize regexes that contain \g and \G. With @^G and @^g global
variables, every regex that uses \g and \G will have to explicitly
examine the global variables every time it wants to match \g or \G,
because the values of @^g and @^G will not be known until run time and
might vary. But if you have a lexically scoped declaration instead of
a global variable, then Perl will be able to compile \g as if you had
said [()] or whatever, and \G similarly. This will make the regex
engine run faster.
(As a side note, there is no such variable as $^g, so you will have to
think of something else to call it. Perhaps ${^Group_Open} and
${^Group_Close}?)
(Also, the \G escape already has a meaning in Perl 5, so it would
probably be better to think of some other name.)
> =head1 PROBLEMS
>
> How should a \G without a prior \g be interpreted in a regular expression?
I don't think that's a big problem. One reasonable option is to make
it a compile-time error:
\G without preceding \g in pattern at ...
So presumably Larry will be able to think of other reasonable
behaviors also.
The big problem I see that you didn't address is that you didn't say
what would happen when the target string contains mismatched
parentheses.
Your example was:
$string = "([b - (a + 1)] * 7)";
$string =~ /\g.*?\G/;
Now here \g matches the "(" and sets up \G so that \G will only match
the corresponding ")". Then .*? matches "[b - (a + 1)] * 7" and \G
matches the ")".
Now suppose the string were
$string = "(b - a + 1] * 7)";
$string =~ /\g.*?\G/;
Now what happens here? \g matches "(" and sets up \G so that \G will
only match the corresponding ")". Then what? I'm not sure from your
proposal.
Your later example (in the 'implementation' section) suggests that '['
and ']' are ignored once \g matches a '('. If that is true, then in
the example above, the .*? would match "bb - a + 1] * 7". I think
this won't be what people will want from \g...\G. We will still going
to get a lot of questions from people asking how to tell if the
delimiters in a string are balanced.
(Site note: I'm not sure why you used .*? here instead of .*, since as
I understand your proposal, .* would have done the same thing. I
suggest that you change .*? to .* or else add a remark about why this
would be different.)
Another ambiguity in your proposal: You want
[\g]
to match any single open delimiter character. But then later on you have
an example where @^g contains the string "/*". What would [\g] do in
this case?
> As it continues scanning, it encounters the "]" between the "f" and the
> ")". The \G does not match this "]" character, because the \g must match
> a ")".
You mean \G here instead of \g, don't you?
> sub parse
> {
> my $string = shift;
> while ($string =~ /([^\g])*(\g)(.*?)(\G)([^\g\G]*)/g)
Don't you mean ([^\g]*) instead of ([^\g])* here?