There have been a lot of substantial updates to the Parrot Grammar Engine, so I figure it's a good time for a summary report. The latest version (in the Parrot cvs trunk) now incorporates:
- subrules - capture aliases (named and numbered) - the :w modifier and <?ws> subrules - word boundaries (\b and \B) - simple backreferences ($1, $2, $<term>, etc.) Also, more tests have been added to t/p6rules, but we definitely need more. Tests are especially needed for: - Rules specified with and without :w modifiers - calls to subrules - named alias captures - numbered alias captures - \s \S \w \W \d \D \n \N S05 is currently being updated to include information about how captures work in Perl 6 (and PGE). But for those who are eager to get started, I've summarized the essentials as PGE implements them below. Questions, comments, patches, and tests welcomed. Pm ========== * Rules are Parrot subroutines that know how to match strings. To compile a rule, one uses the "PGE::p6rule" function: .sub main .local pmc p6rule .local pmc rulesub .local pmc match load_bytecode "PGE.pbc" p6rule = find_global "PGE", "p6rule" rulesub = p6rule(":w (\w+) \:= (\S+)") match = rulesub("dog := spot") * A rule subroutine returns a "match object" containing the results of the match. In Perl 6, this object will be known as C< $/ >. * A match object returned from a successful match has the following characteristics: - true in boolean context - 1 in a numeric context (may change later with :g modifier) - the string matched in string context - .from() and .to() are offsets delimiting the string where the match was found - contains other match objects resulting from captured subpatterns or subrules in the match * A rule containing capturing parens gets additional match objects for each set of parens. Thus a rule like: $1 $2 rulesub = p6rule(":w(\w+) \:= (\S+)") captures the word characters prior to the ":=" into $/[0], and the non-space characters following the ":=" into $/[1]. (In Perl 6, the $1, $2, ... variables will be aliases to $/[0], $/[1], ... . ) rulesub = p6rule(":w(\w+) \:= (\S+)") match = rulesub(" let foo := 123 ") print match # outputs "foo := 123" $P0 = match[0] # first subpattern capture ($1) print $P0 # outputs "foo" $P0 = match[1] # second subpattern capture ($2) print $P0 # outputs "123" * If a capture is quantified with any of '+', '*', or '**{m..n}', then it generates an array of match objects for the subpattern capture instead of a single match object: rulesub = p6rule(":w(\w+) \:= (\S+ )*") match = rulesub(" foo := zip boom bah") print match # outputs "foo := zip boom bah" $P0 = match[0] # first subpattern capture ($1) print $P0 # outputs "foo" $P1 = match[1] # second subpattern array ($2) $P2 = $P1[0] # second repetition ($2[0]) print $P2 # outputs "zip " $P2 = $P1[1] # second repetition ($2[1]) print $P2 # outputs "boom " * Match objects for nested captures are nested into the surrounding capture object. Thus, given rulesub = p6rule(":w (let) ( (\w+) \:= (\S+) )") match = rulesub("let foo := 123") the outer match object contains two match objects ($/[0] and $/[1]), and the second of these contains two match objects at $/[1][0] and $/[1][1]. print match # outputs "let foo := 123" $P0 = match[0] # first subcapture ($1) print $P0 # outputs "let" $P0 = match[1] # second subcapture ($2) $P1 = $P0[0] # first nested capture ($2[0]) print $P1 # outputs "foo" $P1 = $P0[1] # second nested capture ($2[1]) print $P2 # outputs "123" * Non-capturing subpatterns don't nest match objects: rulesub = p6rule(":w (let) [ (\w+) \:= (\S+) ]") match = rulesub("let foo := 123") print match # outputs "let foo := 123" $P0 = match[0] # first subcapture ($1) print $P0 # outputs "let" $P0 = match[1] # second subcapture ($2) print $P0 # outputs "foo" $P0 = match[2] # third subcapture ($3) print $P0 # outputs "123" * To define a subrule, store its subroutine into a symbol table somewhere: rulesub = p6rule("int | double | float | char") store_global "type", rulesub rulesub = p6rule("\w+") store_global "ident", rulesub * To match a subrule, put the name of the subrule in angle brackets: rulesub = p6rule(":w<type> <ident>") match = rulesub(" int argc ") print match # outputs "int argc" * Subrule captures become named keys in the resulting match object: rulesub = p6rule(":w<type> <ident>") match = rulesub(" int argc ") print match # outputs "int argc" $P0 = match["type"] # get type subrule ($/<type>) print $P0 # outputs "int" $P0 = match["ident"] # get ident match ($/<ident>) print $P0 # outputs "argc" * Quantified subrules produce an array of match objects rulesub = p6rule(":w<type> <ident> [ , <ident>]*") (match) = rulesub(" float alpha, beta, gamma") $P0 = match["type"] # get type subrule ($/<type>) print $P0 # outputs "float" $P0 = match["ident"] # get ident subrule (array) $P1 = $P0[0] # first ident ($/<ident>[0]) print $P1 # outputs "alpha" $P1 = $P0[1] # second ident ($/<ident>[1]) print $P1 # outputs "beta" * Captures can be aliased via named aliases: rulesub = p6rule(":w $<key>:=[\w+] = $<val>:=[\S+]") (match) = rulesub(" abc = 123") $P0 = match["key"] # get "key" capture print $P0 # outputs "abc" $P0 = match["val"] # get "val" capture print $P0 # outputs "123" * Or you can use numbered aliases: rulesub = p6rule(":w $3:=[\w+] = $1:=[\S+]") (match) = rulesub(" abc = 123") $P0 = match[0] # get $1 print $P0 # outputs "123" $P0 = match[2] # get $3 print $P0 # outputs "abc" PGE provides the "dump" method for match objects to provide a data dump of the results. Here's a long example for parsing arithmetic expressions using the following grammar: rule factor { \w+ | \( <expr> \) } rule term {:w <factor> [ (\*|/) <factor> ]* } rule expr {:w <term> [ (\+|-) <term> ]* } The PIR code is .sub _main .local pmc p6rule .local pmc match load_bytecode "../../runtime/parrot/library/PGE.pbc" p6rule = find_global "PGE", "p6rule" $P0 = p6rule("\w+ | \\( <expr> \\)") store_global "factor", $P0 $P0 = p6rule(":w <factor> [ $<op>:=(\*|/) <factor> ]*") store_global "term", $P0 $P0 = p6rule(":w <term> [ $<op>:=(\+|-) <term> ]*") store_global "expr", $P0 $P0 = p6rule("<expr>") match = $P0("ab * (de + fg) - jk") match."dump"("$/") .end When the above is executed, the match."dump" call above produces the following output displaying the contents of the match object in $/: $/: <ab * (de + fg) - jk @ 0> 1 $/<expr>: <ab * (de + fg) - jk @ 0> 1 $/<expr><term>[0]: <ab * (de + fg) @ 0> 1 $/<expr><term>[0]<op>[0]: <* @ 3> 1 $/<expr><term>[0]<factor>[0]: <ab @ 0> 1 $/<expr><term>[0]<factor>[1]: <(de + fg) @ 5> 1 $/<expr><term>[0]<factor>[1]<expr>: <de + fg @ 6> 1 $/<expr><term>[0]<factor>[1]<expr><term>[0]: <de @ 6> 1 $/<expr><term>[0]<factor>[1]<expr><term>[0]<factor>[0]: <de @ 6> 1 $/<expr><term>[0]<factor>[1]<expr><term>[1]: <fg @ 11> 1 $/<expr><term>[0]<factor>[1]<expr><term>[1]<factor>[0]: <fg @ 11> 1 $/<expr><term>[0]<factor>[1]<expr><op>[0]: <+ @ 9> 1 $/<expr><term>[1]: <jk @ 17> 1 $/<expr><term>[1]<factor>[0]: <jk @ 17> 1 $/<expr><op>[0]: <- @ 15> 1