On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote: > I have looked through the latest > revisions of Apo05 and Syn05 (from Dec 2004) and come up with the > following list: > > http://japhy.perlmonk.org/perl6/rules.txt
I'll review the list below, but it's also worthwhile to read http://www.nntp.perl.org/group/perl.perl6.language/21120 which is Larry's latest missive on character classes, and http://www.nntp.perl.org/group/perl.perl6.language/20985 which describes the capturing semantics (but be sure to note the lengthy threads that follow concerning changes in the indexing from $1, $2, ... to $0, $1, ... ). Here's my comments on the table at http://japhy.perlmonk.org/perl6/rules.txt, downloaded 26-May 1526 UTC: CHAR EXAMPLE IMPL DESCRIPTION =========================================== & a&b N conjunction &var N subroutine I'm not sure that "&var" means subroutine anymore. A05 does mention it, but S05 does not, and I think it invites way too much confusion with conjunctions. Consider "a&var($x|$y)" versus "a & var ( $x | $y )". But if are allowing &var (and I hope we do not), then the parens are required. x* Y previous atom 0 or more times x**{n..m} N previous atom n..m times Keeping in mind that the "n..m" can actually be any sort of closure (although it's not implemented that way yet in PGE). The rules engine will generally optimize parsing and handling of "n..m" when it can (e.g., when "n" and "m" are both constants). ( (x) Y capture 'x' ) Y must match opening '(' It may be worth noting that parens not only capture, they also introduce a new scope for any nested subpattern and subrule captures. :ignorecase N case insensitivity :i :global N match globally :g :continue N start scanning after previous match :c ...etc I'm not sure these are "tokens" in the sense of "single unit of purpose" in your original message. I think these are all adverbs, and the "token" is just the initial C<:> at the beginning of a group. :keepall N all rules and invoked rules remember everything That's now ":parsetree" according to Damian's proposed capture rules. <commit> N backtracking fails completely <cut> N remove what matched up to this point from the string <after P> N we must be after the pattern P <!after P> N we must NOT be after the pattern P <before P> N we must be before the pattern P <!before P> N we must NOT be before the pattern P As with ':words', etc., I'm not sure that these qualify as "tokens" when parsing the regex -- the tokens are actually "<" or "<!" and indicate a call to a subrule of some sort, and these are just predefined rules. The rules parser and engine may indeed tokenize them for optimization purposes, but I don't think the language defines them as fundamental "tokens", and someone is free to override the predefined rules with their own. (Perhaps <cut> and <commit> cannot be overridden.) <?ws> N match whitespace by :w rules <?sp> N match a space character (chr 32 ONLY) Here the token is "<?", indicating a non-capturing subrule. <$rule> N indirect rule <::$rulename> N indirect symbolic rule <@rules> N like '@rules' <%rules> N like '%rules' <{ code }> N code produces a rule <&foo()> N subroutine returns rule <( code )> N code must return true or backtracking ensues Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&", and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$", "<!@", etc. counterparts. Of course, one could claim that these are really separated as in "<", "?", and "$" tokens, but PGE's parser currently treats them as a unit to make it easier to jump directly into the correct handler for what follows. <[a-z]> N character class <+alpha> N character class <-[a-z]> N complemented character class The tokens for character class manipulation are currently "<[", "<+", and "<-", although that's not officially documented in A05 or S05 yet. Also, ranges are now <[a..z]> -- an unescaped hyphen appearing in an enumerated character class generates a warning. <+\w-[0-9]> N character class "arithmetic" I'm not sure that it's been decided/documented that \w, \s, etc. can appear in character class arithmetic (although it seems like it should). <prop:X> N Unicode property match <-prop:X> N complemented Unicode property match Here "prop" is just a subrule (or character class) similar to <+alpha>, <+digit>, etc. Also, note that <prop:X> is a capturing subrule, while <+prop:X> would be a character class match (and presumably not capture). <rule> N match rule (and capture to $rule) <?rule> N match rule (don't capture) <<rule>> N match rule (don't capture) Do we still have the <<rule>> syntax, or was that abandoned in favor of <?rule> ? (I know there are still some remnants of <<...>> in S05 and A05, but I'm not sure they're intentional.) > Thanks for your help. Unless you're difficult. "You're welcome" unless $Pm ~~ /<?difficult>/; Pm