RFC 145 (v1) Brace-matching for Perl Regular Expressions

Perl6 RFC Librarian Thu, 24 Aug 2000 08:38:10 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Brace-matching for Perl Regular Expressions

=head1 VERSION

  Maintainer: Eric J. Roode <[EMAIL PROTECTED]>
  Date: 24 Aug 2000
  Version: 1
  Mailing List: [EMAIL PROTECTED]
  Number: 145

=head1 ABSTRACT

It is quite difficult to match paired characters in Perl 5 regular
expressions. A solution is proposed, using new \g (match opening grouping
character) and \G (match closing grouping character) metacharacters.
Two new special variables, @^g and @^G control which strings are 
considered grouping characters and what their complement is.

=head1 DESCRIPTION

A new regular expression metacharacter \g would match any of the 
following characters: ([{"'< in a regexp. A later \G metacharacter
would match the corresponding closing pair character )]{"'> I<at the
same nesting level> within the string being searched. 

For example,
    $string = "([b - (a + 1)] * 7)";
    $string =~ /\g.*?\G/;

The \g would match the first open parenthesis, the .*? would match the
substring "[b - (a + 1)] * 7", and the \G would match the second close
parenthesis.

Used within a character class (square brackets in a regular expression),
\g would match I<any> opening grouping character, and \G would match I<any> 
closing grouping character. Thus, [^\g\G]* would return a span of non-
grouping characters.

What exactly is matched by \g and \G is controlled by two new special
variables, @^g and @^G, which are arrays of strings. The default values
of these variables are:

    @^g = ('(', '{', '[', '"', "'", '<');
    @^G = (')', '}', ']', '"', "'", '>');

One could restrict the arrays for various parsing situations:
    {
        local @^g = ( '(' );
        local @^G = ( ')' );
        ...
    }

Or one could override the arrays completely:
    {
        local @^g = ( '/*', '"' );
        local @^G = ( '*/', '"' );
        $c_code_snippet =~ /\g(.*?)\G/;  # Find string or comment
    }

It is a run-time error to compile a regular expression that contains \g
or \G while the @^g and @^G arrays do not contain the same number of
elements.



=head1 EXAMPLES

# Recursive processing of nested groupings
sub parse
{
    my $string = shift;
    while ($string =~ /([^\g])*(\g)(.*?)(\G)([^\g\G]*)/g)
    {
        my ($pre, $quote, $mid, $endquote, $post) = ($1,$2,$3,$4,$5);
        process ($pre);
        parse ($mid);   # Note recursion
        process ($post);
    }
}

=head1 IMPLEMENTATION

Disclaimer: I know little about Perl internals, particularly the RE engine.

When the RE engine encounters a \g, it should match if it finds an open
grouping character. From that point forward, it should maintain an internal
count of "like" open and close grouping characters, When it encounteres a 
\G metacharacter, it should match if it finds a closing grouping character
of the same sort (ie, the complement to the specific string that was matched
by the \g), I<and> if the nesting level of that pair is zero. 

So, in parsing the string "(abc[def](ghi)jkl)" with the RE /\g(.*?)\G/:

    First, \g matches "(". The engine remembers that it is looking for "(" and
    its complement ")".

    Next, as it processes .*?, scanning the string, it encounters the "["
    between the "c" and the "d". It ignores it (it has no effect on the 
    nesting-level count, since it is not "(" or ")").

    As it continues scanning, it encounters the "]" between the "f" and the
    ")". The \G does not match this "]" character, because the \g must match
    a ")".

    Next, as it continues processing the .*?, it enounters the "(" between
    the "]" and the "g". This does match the current grouping set, so the
    engine increments the nesting level to 1.

    Next, it encounters the ")" between the "i" and the "j". The \G does not
    match this ")", because the nesting level is not zero. Having encountered
    the ")", however, the engine decrements the nesting level to 0.

    Finally, it encounters the ")" after the "l". This one does match the \G
    because the nesting level is 0.

Nested \g\G pairs present a problem in that the engine must remember which
pair of grouping characters it is looking for. Example:

Parse the string "(abc[def](ghi)jkl)" with the RE /\g(.*?)\g(.*?)\G(.*?)\G/:

    \g matches "(". Engine remembers that this \g corresponds to "(", ")".

    .*? matches "abc".

    The second \g matches the "[". The engine remembers that I<this> \g  
    corresponds to "[", "]".

    The second .*? matches "def". 

    The \G matches "]", since the second \g was for square brackets, and the
    nesting level for square brackets is zero.

    The third .*? matches "(ghi)jkl". The ")" between the "i" and the "j" 
    does not match the \G pattern because, as in the first example, the
    nesting level is not zero.

    The second \G matches ")".

=head1 PROBLEMS

How should a \G without a prior \g be interpreted in a regular expression?
Probably the same as [\G] (ie, match any closing character), perhaps someone
has a different idea.

How should an expression like /\g?.*?\G/ be interpreted? Specifically, what
should the meaning of the \G be if the optional \g does not match?

=head1 REFERENCES

perlre perldoc page for general discussion of existing regexps

Mastering Regular Expressions book

Blue Camel, chapter 2.
RFC 145 (v1) Brace-matching for Perl Regular Expressions

Reply via email to