On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote:
> I have looked through the latest 
> revisions of Apo05 and Syn05 (from Dec 2004) and come up with the 
> following list:
> 
>   http://japhy.perlmonk.org/perl6/rules.txt
I'll review the list below, but it's also worthwhile to read

   http://www.nntp.perl.org/group/perl.perl6.language/21120

which is Larry's latest missive on character classes, and

   http://www.nntp.perl.org/group/perl.perl6.language/20985

which describes the capturing semantics (but be sure to note
the lengthy threads that follow concerning changes in the
indexing from $1, $2, ... to $0, $1, ... ).

Here's my comments on the table at http://japhy.perlmonk.org/perl6/rules.txt,
downloaded 26-May 1526 UTC:

        CHAR    EXAMPLE         IMPL    DESCRIPTION
        ===========================================
        &       a&b             N       conjunction 
                &var            N       subroutine

I'm not sure that "&var" means subroutine anymore.  A05 does mention
it, but S05 does not, and I think it invites way too much confusion
with conjunctions.  Consider "a&var($x|$y)" versus "a & var ( $x | $y )".
But if are allowing &var (and I hope we do not), then the parens are 
required.

        x*              Y       previous atom 0 or more times
        x**{n..m}       N       previous atom n..m times

Keeping in mind that the "n..m" can actually be any sort of closure
(although it's not implemented that way yet in PGE).  The rules
engine will generally optimize parsing and handling of "n..m" when
it can (e.g., when "n" and "m" are both constants).

        (       (x)             Y       capture 'x'
        )                       Y       must match opening '('

It may be worth noting that parens not only capture, they also 
introduce a new scope for any nested subpattern and subrule captures.

        :ignorecase     N       case insensitivity :i
        :global         N       match globally :g
        :continue       N       start scanning after previous match :c
        ...etc

I'm not sure these are "tokens" in the sense of "single unit of purpose"
in your original message.  I think these are all adverbs, and the "token"
is just the initial C<:> at the beginning of a group.

        :keepall        N       all rules and invoked rules remember everything

That's now  ":parsetree" according to Damian's proposed capture rules.

        <commit>        N       backtracking fails completely
        <cut>           N       remove what matched up to this point from the 
string
        <after P>       N       we must be after the pattern P
        <!after P>      N       we must NOT be after the pattern P
        <before P>      N       we must be before the pattern P
        <!before P>     N       we must NOT be before the pattern P

As with ':words', etc., I'm not sure that these qualify as "tokens"
when parsing the regex -- the tokens are actually "<" or "<!" and
indicate a call to a subrule of some sort, and these are just predefined
rules.  The rules parser and engine may indeed tokenize them for 
optimization purposes, but I don't think the language defines them 
as fundamental "tokens", and someone is free to override the predefined
rules with their own.  (Perhaps <cut> and <commit> cannot be overridden.)

        <?ws>           N       match whitespace by :w rules
        <?sp>           N       match a space character (chr 32 ONLY)

Here the token is "<?", indicating a non-capturing subrule.

        <$rule>         N       indirect rule 
        <::$rulename>   N       indirect symbolic rule 
        <@rules>        N       like '@rules'
        <%rules>        N       like '%rules'
        <{ code }>      N       code produces a rule
        <&foo()>        N       subroutine returns rule
        <( code )>      N       code must return true or backtracking ensues

Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&", 
and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$",
"<!@", etc. counterparts.  Of course, one could claim that these are
really separated as in "<", "?", and "$" tokens, but PGE's parser currently
treats them as a unit to make it easier to jump directly into the correct
handler for what follows.

        <[a-z]>         N       character class
        <+alpha>        N       character class
        <-[a-z]>        N       complemented character class

The tokens for character class manipulation are currently "<[", "<+",
and "<-", although that's not officially documented in A05 or S05 yet.
Also, ranges are now <[a..z]> -- an unescaped hyphen appearing in an
enumerated character class generates a warning.

        <+\w-[0-9]>     N       character class "arithmetic"

I'm not sure that it's been decided/documented that \w, \s, etc.
can appear in character class arithmetic (although it seems like it
should).

        <prop:X>        N       Unicode property match
        <-prop:X>       N       complemented Unicode property match

Here "prop" is just a subrule (or character class) similar to
<+alpha>, <+digit>, etc.  Also, note that <prop:X> is a capturing
subrule, while <+prop:X> would be a character class match (and presumably
not capture).

        <rule>          N       match rule (and capture to $rule)
        <?rule>         N       match rule (don't capture)
        <<rule>>        N       match rule (don't capture)

Do we still have the <<rule>> syntax, or was that abandoned in
favor of <?rule> ?  (I know there are still some remnants of <<...>>
in S05 and A05, but I'm not sure they're intentional.)

> Thanks for your help.  Unless you're difficult.

    "You're welcome"  unless $Pm ~~ /<?difficult>/;

Pm

Reply via email to