Author: larry
Date: Wed Mar 19 09:39:02 2008
New Revision: 14525

Modified:
   doc/trunk/design/syn/S05.pod

Log:
Add <*abc> form for sequential optional characters


Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod        (original)
+++ doc/trunk/design/syn/S05.pod        Wed Mar 19 09:39:02 2008
@@ -14,9 +14,9 @@
    Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
                Larry Wall <[EMAIL PROTECTED]>
    Date: 24 Jun 2002
-   Last Modified: 17 Mar 2008
+   Last Modified: 19 Mar 2008
    Number: 5
-   Version: 74
+   Version: 75
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them I<regex> rather than "regular
@@ -1145,32 +1145,6 @@
 
 =item *
 
-The special named assertions include:
-
-     / <?before pattern> /    # lookahead
-     / <?after pattern> /     # lookbehind
-
-     / <?same> /              # true between two identical characters
-
-     / <.ws> /                # match "whitespace":
-                              #   \s+ if it's between two \w characters,
-                              #   \s* otherwise
-
-     / <?at($pos)> /          # match only at a particular StrPos
-                              # short for <?{ .pos === $pos }>
-                              # (considered declarative until $pos changes)
-
-The C<after> assertion implements lookbehind by reversing the syntax
-tree and looking for things in the opposite order going to the left.
-It is illegal to do lookbehind on a pattern that cannot be reversed.
-
-Note: the effect of a forward-scanning lookbehind at the top level
-can be achieved with:
-
-    / .*? prestuff <( mainpat )> /
-
-=item *
-
 A leading C<.> causes a named assertion not to capture what it matches (see
 L<Subrule captures>. For example:
 
@@ -1225,7 +1199,8 @@
 This assertion is not automatically captured.
 
 As with bare hash, the longest key matches according to the venerable
-I<longest-token rule>.
+I<longest-token rule>.  [Conjecture: <%foo> may not be supported in 6.0, or
+may be retargeted to matching an abbreviation table.]
 
 =item *
 
@@ -1366,6 +1341,90 @@
     <.alpha>    # match a letter, don't capture
     <?alpha>    # match null before a letter, don't capture
 
+The special named assertions include:
+
+     / <?before pattern> /    # lookahead
+     / <?after pattern> /     # lookbehind
+
+     / <?same> /              # true between two identical characters
+
+     / <.ws> /                # match "whitespace":
+                              #   \s+ if it's between two \w characters,
+                              #   \s* otherwise
+
+     / <?at($pos)> /          # match only at a particular StrPos
+                              # short for <?{ .pos === $pos }>
+                              # (considered declarative until $pos changes)
+
+The C<after> assertion implements lookbehind by reversing the syntax
+tree and looking for things in the opposite order going to the left.
+It is illegal to do lookbehind on a pattern that cannot be reversed.
+
+Note: the effect of a forward-scanning lookbehind at the top level
+can be achieved with:
+
+    / .*? prestuff <( mainpat )> /
+
+=item *
+
+A leading C<*> indicates that the following pattern allows a
+partial match.  It always succeeds after matching as many characters
+as possible.  (It is not zero-width unless 0 characters match.)
+For instance, to match a number of abbreviations, you might write
+any of:
+
+    s/ ^ G<*n|enesis>   $ /gen/  or
+    s/ ^ Ex<*odos>      $ /ex/   or
+    s/ ^ L<*v|eviticus> $ /lev/  or
+    s/ ^ N<*m|umbers>   $ /num/  or
+    s/ ^ D<*t|euronomy> $ /deut/ or
+    ...
+
+    / (<* <foo bar baz> >) /
+
+    / <[EMAIL PROTECTED]> / and return %long{$<short>} || $<short>;
+
+The pattern is restricted to declarative forms that can be rewritten
+as nested optional character matches.  Sequence information
+may not be discarded while making all following characters optional.
+That is, it is not sufficient to rewrite:
+
+    <*xyz>
+
+as:
+
+    x? y? z?            # bad, would allow xz
+
+Instead, it must be implemented as:
+
+    [x [y z?]?]?        # allow only x, xy, xyz (and '')
+
+Explicit quantifiers are allowed on single characters, so this:
+
+    <* a b+ c | ax*>
+
+is rewritten as something like:
+
+    [a [b+]? c?]? | [a x*]?
+
+In the latter example we're assuming the DFA token matcher is going to
+give us the longest match regardless.  It's also possible that quantified
+multichar sequences can be recursively remapped:
+
+    <* 'ab'+>     # match a, ab, ababa, etc. (but not aab!)
+    ==> [ 'ab'* <*ab> ]
+    ==> [ 'ab'* [a b?]? ]
+
+[Conjecture: depending on how fancy we get, we might (or might not)
+be able to autodetect ambiguities in C<< <[EMAIL PROTECTED]> >> and refuse to
+generate ambiguous abbreviations (although exact match of a shorter
+abbrev should always be allowed even if it's the prefix of a longer
+abbreviation).  If it is not possible, then the user will have to
+check for ambiguities after the match. Note also that the array
+form is assuming the array doesn't change often.  If it does, the
+longest-token matcher has to be recalculated, which could get
+expensive.]
+
 =item *
 
 A leading C<~~> indicates a recursive call back into some or all of

Reply via email to