Author: larry
Date: Mon Mar 26 17:58:55 2007
New Revision: 14354
Modified:
doc/trunk/design/syn/S05.pod
Log:
Suggestions from TheDamian++ and Juerd++ and others.
All punctuation is now treated as potentially meta.
<'foo'> form is gone; just use 'foo'.
Conjectural syntax positive and negative submatches of isolated substring.
Conjectural syntax for recursive calls to anonymous substructures.
< a b c > is now just a list of strings.
Modified: doc/trunk/design/syn/S05.pod
==
--- doc/trunk/design/syn/S05.pod(original)
+++ doc/trunk/design/syn/S05.podMon Mar 26 17:58:55 2007
@@ -14,9 +14,9 @@
Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
Larry Wall <[EMAIL PROTECTED]>
Date: 24 Jun 2002
- Last Modified: 9 Feb 2007
+ Last Modified: 26 Mar 2007
Number: 5
- Version: 54
+ Version: 55
This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them I rather than "regular
@@ -77,6 +77,57 @@
of declarative and procedural matching so that we can have the
best of both. See the section below on "Longest-token matching".
+=head1 Simplified lexical parsing
+
+Unlike traditional regular expressions, Perl 6 does not require
+you to memorize an arbitrary list of metacharacters. Instead it
+classifies characters by a simple rule. All glyphs (graphemes)
+whose base characters are either the underscore (C<_>) or have
+a Unicode classification beginning with 'L' (i.e. letters) or 'N'
+(i.e. numbers) are always literal (i.e. self-matching) in regexes. They
+must be escaped with a C<\> to make them metasyntactic (in which
+case that single alphanumeric character is itself metasyntactic,
+but any immediately following alphanumeric character is not).
+
+All other glyphs--including whitespace--are exactly the opposite:
+they are always considered metasyntactic (i.e. non-self-matching) and
+must be escaped or quoted to make them literal. As is traditional,
+they may be individually escaped with C<\>, but in Perl 6 they may
+be also quoted as follows.
+
+Sequences of one or more glyphs of either type (i.e. any glyphs at all)
+may be made literal by placing them inside single quotes. (Double
+quotes are also allowed, with the usual interpolative semantics.)
+Quotes create a quantifiable atom, so while
+
+moose*
+
+quantifies only the 'e' and match "mooseee", saying
+
+'moose'*
+
+quantifies the whole string and would match "moosemoose".
+
+Here is a table that summarizes the distinctions:
+
+AlphanumericsNon-alphanumerics Mixed
+
+Literal glyphs a1_\* \$ \. \\ \' K\-9\!
+Metasyntax \a \1 \_ * $ .\' \K-\9!
+Quoted glyphs 'a' '1' '_' '*' '$' '.' '\\' '\'' 'K-9!'
+
+In other words, identifier glyphs are literal (or metasyntactic when
+escaped), non-identifier glyphs are metasyntactic (or literal when
+escaped), and single quotes make everything inside them literal.
+
+Note, however, that not all non-identifier glyphs are currently
+meaningful as metasyntax in Perl 6 regexes (e.g. C<\1> C<\_> C<->
+C). It is more accurate to say that all unescaped non-identifier
+glyphs are I metasyntax, and reserved for future use.
+If you use such a sequence, a helpful compile-time error is issued
+indicating that you either need to quote the sequence or define a new
+operator to recognize it.
+
=head1 Modifiers
=over
@@ -240,23 +291,27 @@
The C<:s> modifier is considered sufficiently important that
match variants are defined for them:
-ms/match some words/# same as m:sigspace
+mm/match some words/# same as m:sigspace
ss/match some words/replace those words/# same as s:sigspace
-Conjecture: This might become sufficiently idiomatic that C would
-be better as a "stuttered" C instead, much as C became idiomatic.
-It would also match C that way.
-
=item *
New modifiers specify Unicode level:
- m:bytes / .**{2} / # match two bytes
- m:codes / .**{2} / # match two codepoints
- m:graphs/ .**{2} / # match two graphemes
- m:langs / .**{2} / # match two language dependent chars
-
-There are corresponding pragmas to default to these levels.
+ m:bytes / .**{2} / # match two bytes
+ m:codes / .**{2} / # match two codepoints
+ m:graphs / .**{2} / # match two language-independent graphemes
+ m:chars / .**{2} / # match two characters at current max level
+
+There are corresponding pragmas to default to these levels. Note that
+the C<:chars> modifier is always redundant because dot always matches
+characters at the highest level allowed in scope. This highest level
+may be identical to one of the other three levels, or it may be more
+specific than C<:graphs> when a particular language's character rules
+are