Author: larry
Date: Thu Apr 20 02:07:51 2006
New Revision: 8883

Modified:
   doc/trunk/design/syn/S05.pod

Log:
Various clarifications.
Documented that null first alternative is ignored.
Removed colon separator after last modifier, now just use space.
Deleted the :once modifier.  (A state variable suffices.)
A match object in boolean context isn't always forced to be eager.
Added :ratchet and :panic modifiers to limit backtracking in the parser.
Clarified when rules are allowed vs enforced in variable usage.
Added <%a|%b|%c> form for simple longest-token scoping.
Clarified that hash matches skip over key before value is matched.
Documented behavior of $<KEY>.
Added *+ ++ ?+ and :+ to force greed on specific atom.
Added token and parse rule variants for grammar productions.
Added <<<...>>> syntax.


Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod        (original)
+++ doc/trunk/design/syn/S05.pod        Thu Apr 20 02:07:51 2006
@@ -11,11 +11,11 @@
 
 =head1 VERSION
 
-   Maintainer: Patrick Michaud <[EMAIL PROTECTED]>
+   Maintainer: Patrick Michaud <[EMAIL PROTECTED]> (& TimToady)
    Date: 24 Jun 2002
-   Last Modified: 6 Apr 2006
+   Last Modified: 20 Apr 2006
    Number: 5
-   Version: 15
+   Version: 16
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them "rules" because they haven't been
@@ -30,8 +30,8 @@
 it doesn't look like it.  The individual capture variables (such as C<$0>,
 C<$1>, etc.) are just elements of C<$/>.
 
-By the way, the numbered capture variables now start at C<$0>, C<$1>,
-C<$2>, etc. See below.
+By the way, the numbered capture variables now start at C<$0> rather than
+C<$1>.  See below.
 
 =head1 Unchanged syntactic features
 
@@ -68,6 +68,8 @@
 =item *
 
 The extended syntax (C</x>) is no longer required...it's the default.
+(In fact, it's pretty much mandatory--the only way to get back to
+the old syntax is with the C<:Perl5>/C<:P5> modifier.)
 
 =item *
 
@@ -78,7 +80,11 @@
 
 There is no C</e> evaluation modifier on substitutions; instead use:
 
-     s/pattern/{ code() }/
+     s/pattern/{ doit() }/
+
+Instead of C</ee> say:
+
+     s/pattern/{ eval doit() }/
 
 =item *
 
@@ -87,8 +93,9 @@
      m:g:i/\s* (\w*) \s* ,?/;
 
 Every modifier must start with its own colon.  The delimiter must be
-separated from the final modifier by a colon or whitespace if it would
-be taken as an argument to the preceding modifier.
+separated from the final modifier by whitespace if it would be taken
+as an argument to the preceding modifier (which is true for any
+bracketing character).
 
 =item *
 
@@ -127,19 +134,13 @@
 
 is roughly equivalent to
 
-     m:p/.*? pattern/
-
-=item *
-
-The new C<:once> modifier replaces the Perl 5 C<?...?> syntax:
+     m:p/.*? <( pattern )> /
 
-     m:once/ pattern /    # only matches first time
+Also note that any rule called as a subrule is implicitly anchored to the
+current position anyway.
 
 =item *
 
-[Note: We're still not sure if :w is ultimately going to work exactly 
-as described below.  But this is how it works for now.]
-
 The new C<:w> (C<:words>) modifier causes whitespace sequences to be
 replaced by C<\s*> or C<\s+> subpattern as defined by the C<< <?ws> >> rule.
 
@@ -164,6 +165,9 @@
 C<< <?ws> >> can't decide what to do until it sees the data.  It still does
 the right thing.  If not, define your own C<< <?ws> >> and C<:w> will use that.
 
+In general you don't need to use C<:w> within grammars because
+the parse rules automatically handle whitespace policy for you.
+
 =item *
 
 New modifiers specify Unicode level:
@@ -177,9 +181,9 @@
 
 =item *
 
-The new C<:perl5> modifier allows Perl 5 regex syntax to be used instead:
+The new C<:Perl5> modifier allows Perl 5 regex syntax to be used instead:
 
-     m:perl5/(?mi)^[a-z]{1,2}(?=\s)/
+     m:Perl5/(?mi)^[a-z]{1,2}(?=\s)/
 
 (It does not go so far as to allow you to put your modifiers at
 the end.)
@@ -194,16 +198,16 @@
 If followed by an C<x>, it means repetition.  Use C<:x(4)> for the
 general form.  So
 
-     s:4x { (<?ident>) = (\N+) $$}{$0 => $1};
+     s:4x [ (<?ident>) = (\N+) $$] [$0 => $1];
 
 is the same as:
 
-     s:x(4) { (<?ident>) = (\N+) $$}{$0 => $1};
+     s:x(4) [ (<?ident>) = (\N+) $$] [$0 => $1];
 
 which is almost the same as:
 
      $_.pos = 0;
-     s:c{ (<?ident>) = (\N+) $$}{$0 => $1} for 1..4;
+     s:c [ (<?ident>) = (\N+) $$] [$0 => $1] for 1..4;
 
 except that the string is unchanged unless all four matches are found.
 However, ranges are allowed, so you can say C<:x(1..4)> to change anywhere
@@ -250,10 +254,15 @@
      $str = "abracadabra";
 
      if $str ~~ m:exhaustive/ a (.*) a / {
-         @substrings = $/.matches();    # br brac bracad bracadabr
-                                        # c cad cadabr d dabr br
+         say "@()";    # br brac bracad bracadabr c cad cadabr d dabr br
      }
 
+Note that the C<~~> above can return as soon as the first match is found,
+and the rest of the matches may be performed lazily by C<@()>.
+
+[Conjecture: the C<:exhaustive> modifier should have an optional argument
+specifying how many seconds to run before giving up, since it's trivially
+easy to ask for the heat death of the universe to happen first.]
 
 =item *
 
@@ -275,7 +284,24 @@
 
 =item *
 
-The C<:i>, C<:w>, C<:perl5>, and Unicode-level modifiers can be
+The new C<:ratchet> modifier causes this rule to not backtrack by default.
+(Generally you do not use this modifier directly, since it's implied by
+C<token> and C<parse> declarations.)  The effect of this modifier is
+to imply a C<:> after every construct that could backtrack, including
+bare C<*>, C<+>, and C<?> quantifiers, as well as alternations.
+
+=item *
+
+The new C<:panic> modifier causes this rule and all invoked subrules
+to try to backtrack on any rules that would otherwise default to
+not backtracking because they have C<:ratchet> set.  Never panic
+unless you're desperate and want the pattern matcher to do a lot of
+unnecessary work.  If you have an error in your grammar, it's almost
+certainly a bad idea to fix it by backtracking.
+
+=item *
+
+The C<:i>, C<:w>, C<:Perl5>, and Unicode-level modifiers can be
 placed inside the rule (and are lexically scoped):
 
      m/:w alignment = [:i left|right|cent[er|re]] /
@@ -297,7 +323,6 @@
 To use parens or brackets for your delimiters you have to separate:
 
          m:fuzzy (pattern);
-         m:fuzzy:(pattern);
 
 or you'll end up with:
 
@@ -346,7 +371,10 @@
 
 =item *
 
-An unescaped C<#> now always introduces a comment.
+An unescaped C<#> now always introduces a comment.  If followed
+by an opening bracket character (and if not in the first column),
+it introduces an embedded comment that terminates with the closing
+bracket.  Otherwise the comment terminates at the newline.
 
 =item *
 
@@ -438,7 +466,7 @@
 so that the closure is never actually run in that case.  But it's
 a closure that must be run in the general case, so you can use
 it to generate a range on the fly based on the earlier matching.
-(Of course, bear in mind the closure is run I<before> attempting to
+(Of course, bear in mind the closure must be run I<before> attempting to
 match whatever it quantifies.)
 
 =item *
@@ -473,7 +501,9 @@
 
      / \Q$var\E /
 
-(To get rule interpolation use an assertion - see below)
+However, if C<$var> contains a rule object, rather attempting to
+convert it to a string, it is called as if you said C<< <$var> >>.
+See assertions below.
 
 =item *
 
@@ -486,7 +516,8 @@
      / [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
 
 
-As with a scalar variable, each element is matched as a literal.
+As with a scalar variable, each element is matched as a literal unless
+it happens to be a rule object, in which case it is matched as a subrule.
 
 =item *
 
@@ -503,15 +534,23 @@
 
 =item *
 
-If it is a string or rule object, it is executed as a subrule.
+If it is a string, it is matched literally, starting after where the
+key left off matching.
 
 =item *
 
-If it has the value 1, nothing special happens beyond the match.
+If it is a rule object, it is executed as a subrule, with an initial
+position after the matched key.
 
 =item *
 
-Any other value causes the match to fail.
+If it has the value 1, nothing special happens except that the key match
+succeeds.
+
+=item *
+
+Any other value causes the match to fail.  In particular, shorter keys
+are not tried if a longer one matches and fails.
 
 =back
 
@@ -547,6 +586,11 @@
 tree and looking for things in the opposite order going to the left.
 It is illegal to do lookbehind on a pattern that cannot be reversed.
 
+Note: the effect of a forward-scanning lookbehind at the top level
+can be achieved with:
+
+    / .*? prestuff <( mainpat >) /
+
 =item *
 
 A leading C<?> causes the assertion not to capture what it matches (see
@@ -556,28 +600,66 @@
      / <?ident> <ws>  /      # only $/<ws> captured
      / <?ident> <?ws> /      # nothing captured
 
+The non-capturing behavior may be overridden with a C<:keepall>.
+
 =item *
 
 A leading C<$> indicates an indirect rule.  The variable must contain
-either a hard reference to a rule, or a string containing the rule.
+either a rule object, or a string to be compiled as the rule.  The
+string is never matched literally.
 
 =item *
 
 A leading C<::> indicates a symbolic indirect rule:
 
-     / <::($somename)>
+     / <::($somename)> /
 
-The variable must contain the name of a rule.
+The variable must contain the name of a rule.  By the rules of single method
+dispatch this is first searched for in the current grammar and its ancestors.
+If this search fails an attempt is made to dispatch via MMD, in which case
+it can find rules defined as multis rather than methods.
 
 =item *
 
 A leading C<@> matches like a bare array except that each element
-is treated as a rule (string or hard ref) rather than as a literal.
+is treated as a rule (string or rule object) rather than as a literal.
+That is, a string is forced to be compiled as a rule rather than matched
+literally.  (There is no difference for a rule object.)
 
 =item *
 
-A leading C<%> matches like a bare hash except that each key
-is treated as a rule (string or hard ref) rather than as a literal.
+A leading C<%> matches like a bare hash except that each value is always
+treated as a rule, even if it is a string that must be compiled to a rule
+at match time.
+
+With both bare hash and hash in angles, the key is always skipped
+over before calling any rule in the value.  That rule may, however,
+magically access the key anyway as if the rule had started before the
+key and matched with C<< <KEY> >> assertion.  That is, C<< $<KEY> >>
+will contain the keyword or token that this rule was looked up under,
+and that value will be returned by the current match object even if
+you do nothing special with it within the match.  (This also works
+for the name of a macro as seen from an C<is parsed> rule, since
+internally that turns into a hash lookup.)
+
+As with bare hash, the longest key matches according to the longest token
+rule, but in addition, you may combine multiple hashes under the same
+longest-token consideration like this:
+
+    <%statement|%prefix|%term>
+
+This means that, despite being in a later hash, C<< %term<food> >>
+will be selected in preference to C<< %prefix<foo> >> because it's
+the longer token.  However, if there is a tie, the earlier hash wins,
+so C<< %statement<if> >> hides any C<< %prefix<if> >> or C<< %term<if> >>.
+
+In contrast, if you say
+
+    [ <%prefix> | <%term> ]
+
+a C<< %prefix<foo> >> would be selected in preference to a C<< %term<food> >>.
+(Which is not what you usually want if your language is to do longest-token
+consistently.)
 
 =item *
 
@@ -592,7 +674,7 @@
 rule closure binds the I<result object> for this match, ignores the
 rest of the current rule, and reports success:
 
-       / (\d) <{ return $0.sqrt }> NotReached /;
+        / (\d) <{ return $0.sqrt }> NotReached /;
 
 This has the effect of capturing the square root of the numified string,
 instead of the string.  The C<NotReached> part is not reached.
@@ -654,14 +736,16 @@
     / <after foo> \d+ <before bar> /
 
 except that the scan for "foo" can be done in the forward direction,
-while a lookbehind assertion would presumably scan for \d+ and then
-match "foo" backwards.  The use of C<< <(...)> >> affects only the
+while a lookbehind assertion would presumably scan for C<\d+> and then
+match "C<foo>" backwards.  The use of C<< <(...)> >> affects only the
 meaning of the "result object" and the positions of the beginning and
 ending of the match.  That is, after the match above, C<$()> contains
 only the digits matched, and C<.pos> is pointing to after the digits.
 Other captures (named or numbered) are unaffected and may be accessed
 through C<$/>.
 
+It is a syntax error to use an unbalanced C<< <( >> or C<< )> >>.
+
 =item *
 
 A leading C<[> or C<+> indicates an enumerated character class.  Ranges
@@ -717,6 +801,17 @@
 
      / <!before _ > /    # We aren't before an _
 
+Note that C<< <!alpha> >> is different from C<< <-alpha> >> because the
+latter matches C</./> when it is not an alpha.
+
+=item *
+
+Conjecture: Multiple opening angles are matched by a corresponding
+number of closing angles, and otherwise function as single angles.
+This can be used to visually isolate unmatched angles inside:
+
+    <<<Ccode: a >> 1>>>
+
 =back
 
 =head1 Backslash reform
@@ -904,6 +999,49 @@
 causes it to produce a C<Code> or C<Rule> reference, which the switch
 statement then selects upon.
 
+=item *
+
+Just as C<rx> has variants, so does the C<rule> declarator.
+In particular, there are two special variants for use in grammars:
+C<token> and C<parse>.
+
+A token declaration:
+
+    token ident { [ <alpha> | _ ] \w+ }
+
+never backtracks by default.  That is, it likes to commit to whatever
+it has scanned so far.  The above is equivalent to
+
+    rule ident { [ <alpha>: | _ ]: \w+: }
+
+but rather easier to read.  The bare C<*>, C<+> and C<?> quantifiers
+never backtrack in a C<token> unless some outer rule has specified a
+C<:panic> option that applies.  If you want to prevent even that, use
+C<*:>, C<+:> or C<?:> to prevent any backtracking into the quantifier.
+If you want to explicitly backtrack, append either a C<?> or a C<+>
+to the quantifier.   The C<?> forces minimal matching as usual,
+while the C<+> forces greedy matching.  The C<token> declarator is
+really just short for
+
+    rule :ratchet { ... }
+
+The other is the C<parse> declarator, for declaring non-terminal
+productions in a grammar.  It also does not backtrack unless a
+C<:panic> is in effect or you explicitly specify a backtracking
+quantifier.  In addition, a C<parse> rule also assumes C<:words>.
+A C<parse> is really short for:
+
+    rule :ratchet :words { ... }
+
+=item *
+
+The Perl 5 C<?...?> syntax ("match once") was rarely used and can be
+now emulated more cleanly with a state variable:
+
+    (state $x) ||= / pattern /;    # only matches first time
+
+To reset the pattern, simply set C<$x = 0>.
+
 =back
 
 =head1 Backtracking control
@@ -912,14 +1050,40 @@
 
 =item *
 
+By default, backtracking is greedy in C<rx>, C<m>, C<s>, and the
+like.  It's also greedy in ordinary rules.  In C<parse> and C<token>
+declarations, backtracking must be explicit.
+
+=item *
+
+To force the preceding atom to do eager backtracking,
+append a C<:?> or C<?> to the atom.  If the preceding token is
+a quantifier, the C<:> may be omitted, so C<*?> works just as
+in Perl 5.
+
+=item *
+
+To force the preceding atom to do greedy backtracking,
+append a C<:+> or C<+> to the atom.  If the preceding token
+is a quantifier, the C<:> may be omitted.  (Perl 5 has no
+corresponding construct because backtracking always defaults
+to greedy in Perl 5.)
+
+=item *
+
+To force the preceding atom to do no backtracking, use a single C<:>
+without a subsequent C<?> or C<+>.
 Backtracking over a single colon causes the rule engine not to retry
 the preceding atom:
 
-     m:w/ \( <expr> [ , <expr> ]* : \) /
+     m:w/ \( <expr> [ , <expr> ]*: \) /
 
 (i.e. there's no point trying fewer C<< <expr> >> matches, if there's
 no closing parenthesis on the horizon)
 
+To force all the atoms in an expression not to backtrack by default,
+use C<:ratchet> or C<parse> or C<token>.
+
 =item *
 
 Backtracking over a double colon causes the surrounding group of
@@ -931,8 +1095,12 @@
           ]
      /
 
-(i.e. there's no point trying to match a different keyword if one
-was already found but failed).
+(i.e. there's no point trying to match a different keyword if one was
+already found but failed).  Note that you can still back into such an
+alternation, so you may also need to put C<:> after it if you also
+want to disable that.  If a an explicit or implicit C<:ratchet> has
+disabled backtracking, you need to put C<:+> after the alternation
+to enable backing into another alternative if the first pick fails.
 
 =item *
 
@@ -993,9 +1161,10 @@
 
 =item *
 
-...so too you can have anonymous rules and I<named> rules:
+...so too you can have anonymous rules and I<named> rules (and tokens,
+and parses):
 
-     rule ident { [<alpha>|_] \w* }
+     token ident { [<alpha>|_] \w* }
 
      # and later...
 
@@ -1007,11 +1176,11 @@
 such as:
 
      rule serial_number { <[A..Z]> \d**{8} }
-     rule type { alpha | beta | production | deprecated | legacy }
+     token type { alpha | beta | production | deprecated | legacy }
 
 in other rules as named assertions:
 
-     rule identification { [soft|hard]ware <type> <serial_number> }
+     parse identification { [soft|hard]ware <type> <serial_number> }
 
 =back
 
@@ -1049,6 +1218,10 @@
 
 This makes it easier to catch errors like this:
 
+    /a|b|c|/
+
+As a special case, however, the first null alternative in a match like
+
      m:w/ [
           | if :: <expr> <block>
           | for :: <list> <block>
@@ -1056,6 +1229,19 @@
           ]
      /
 
+is simply ignored.  Only the first alternative is special that way.
+If you write:
+
+     m:w/ [
+              if :: <expr> <block>              |
+              for :: <list> <block>             |
+              loop :: <loop_controls>? <block>  |
+          ]
+     /
+
+
+it's still an error.
+
 =item *
 
 However, it's okay for a non-null syntactic construct to have a degenerate
@@ -1099,6 +1285,10 @@
      # or:
      /pattern/; if $/ {...}
 
+With C<:global> or C<:overlap> or C<:exhaustive> the boolean is
+allowed to return true on the first match.  The C<Match> object can
+produce the rest of the results lazily if evaluated in list context.
+
 =item *
 
 In string context it evaluates to the stringified value of its
@@ -1121,7 +1311,7 @@
 
 =item *
 
-When used as a scalar, a Match object evaluates to its underlying
+When used as a scalar, a C<Match> object evaluates to its underlying
 result object.  Usually this is just the entire match string, but
 you can override that by calling C<return> inside a rule:
 
@@ -1146,7 +1336,7 @@
 Additionally, the C<Match> object delegates its C<coerce> calls
 (such as C<+$match> and C<~$match>) to its underlying result object.
 The only exception is that C<Match> handles boolean coercion itself,
-which returns whether the match had succeeded.
+which returns whether the match had succeeded at least once.
 
 This means that these two work the same:
 
@@ -1155,7 +1345,7 @@
 
 =item *
 
-When used as an array, a Match object pretends to be an array of all
+When used as an array, a C<Match> object pretends to be an array of all
 its positional captures.  Hence
 
      ($key, $val) = m:w/ (\S+) => (\S+)/;
@@ -1179,11 +1369,13 @@
 
 Note that, as a scalar variable, C<$/> doesn't automatically flatten
 in list context.  Use C<@()> as a shorthand for C<@($/)> to flatten
-the positional captures under list context.
+the positional captures under list context.  Note that a C<Match> object
+is allowed to evaluate its match lazily in list context.  Use C<**@()>
+to force an eager match.
 
 =item *
 
-When used as a hash, a Match object pretends to be a hash of all its named
+When used as a hash, a C<Match> object pretends to be a hash of all its named
 captures.  The keys do not include any sigils, so if you capture to
 variable C<< @<foo> >> its real name is C<$/{'foo'}> or C<< $/<foo> >>.
 However, you may still refer to it as C<< @<foo> >> anywhere C<$/>
@@ -1192,7 +1384,8 @@
 
 Note that, as a scalar variable, C<$/> doesn't automatically flatten
 in list context.  Use C<%()> as a shorthand for C<%($/)> to flatten as a
-hash, or bind it to a variable of the appropriate type.
+hash, or bind it to a variable of the appropriate type.  As with C<@()>,
+it's possible for C<%()> to produce its pairs lazily in list context.
 
 =item *
 
@@ -1240,7 +1433,7 @@
 incomplete C<Match> object (which can be modified via the internal C<$/>.
 For example:
 
-    $str ~~ / foo                # Match 'foo'
+    $str ~~ / foo                 # Match 'foo'
                { $/ = 'bar' }     # But pretend we matched 'bar'
              /;
     say $/;                       # says 'bar'
@@ -1556,7 +1749,9 @@
 
 =item *
 
-Any call to a named C<< <rule> >> within a pattern is known as a I<subrule>.
+Any call to a named C<< <rule> >> within a pattern is known as a
+I<subrule>, whether that rule is actually defined as a C<rule> or
+C<token> or C<parse> or even an ordinary C<method> or C<multi>.
 
 =item *
 
@@ -1599,9 +1794,9 @@
 =item *
 
 The hash entries of a C<Match> object can be referred to using any of the
-standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/�baz�>,
+standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/«baz»>,
 etc.), or else via corresponding lexically scoped aliases (C<< $<foo> >>,
-C<$�bar�>, C<< $<baz> >>, etc.)  So the previous example also implies:
+C<$«bar»>, C<< $<baz> >>, etc.)  So the previous example also implies:
 
       #    $<ident>             $0<ident>
       #     __^__                 __^__
@@ -2334,10 +2529,10 @@
 so too a grammar can collect a set of named rules together:
 
      grammar Identity {
-         rule name :w { Name = (\N+) }
-         rule age  :w { Age  = (\d+) }
-         rule addr :w { Addr = (\N+) }
-         rule desc {
+         parse name { Name = (\N+) }
+         parse age  { Age  = (\d+) }
+         parse addr { Addr = (\N+) }
+         parse desc {
              <name> \n
              <age>  \n
              <addr> \n
@@ -2351,22 +2546,22 @@
 Like classes, grammars can inherit:
 
      grammar Letter {
-         rule text     { <greet> <body> <close> }
+         parse text     { <greet> <body> <close> }
 
-         rule greet :w { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
+         parse greet { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
 
-         rule body     { <line>+ }
+         parse body     { <line>+? }
 
-         rule close :w { Later dude, $<from>:=(.+) }
+         parse close { Later dude, $<from>:=(.+) }
 
          # etc.
      }
 
      grammar FormalLetter is Letter {
 
-         rule greet :w { Dear $<to>:=(\S+?) , $$}
+         parse greet { Dear $<to>:=(\S+?) , $$}
 
-         rule close :w { Yours sincerely, $<from>:=(.+) }
+         parse close { Yours sincerely, $<from>:=(.+) }
 
      }
 
@@ -2382,14 +2577,15 @@
 
      grammar Perl {    # Perl's own grammar
 
-         rule prog { <statement>* }
+         parse prog { <statement>* }
 
-         rule statement { <decl>
+         parse statement {
+                   | <decl>
                    | <loop>
                    | <label> [<cond>|<sideff>|;]
          }
 
-         rule decl { <sub> | <class> | <use> }
+         parse decl { <sub> | <class> | <use> }
 
          # etc. etc. etc.
      }
@@ -2439,7 +2635,7 @@
 
      $str.trans( %mapping.pairs.sort );
 
-Use the .= form to do a translation in place:
+Use the C<.=> form to do a translation in place:
 
      $str.=trans( %mapping.pairs.sort );
 

Reply via email to