PGE features update

Patrick R. Michaud Sun, 08 May 2005 08:47:05 -0700

There have been a lot of substantial updates to the Parrot Grammar
Engine, so I figure it's a good time for a summary report.  The latest 
version (in the Parrot cvs trunk) now incorporates:


 -  subrules
 -  capture aliases (named and numbered)
 -  the :w modifier and <?ws> subrules
 -  word boundaries (\b and \B)
 -  simple backreferences  ($1, $2, $<term>, etc.)

Also, more tests have been added to t/p6rules, but we definitely
need more.  Tests are especially needed for:
 -  Rules specified with and without :w modifiers
 -  calls to subrules
 -  named alias captures
 -  numbered alias captures
 -  \s \S \w \W \d \D \n \N

S05 is currently being updated to include information about how
captures work in Perl 6 (and PGE).  But for those who are eager
to get started, I've summarized the essentials as PGE implements
them below.

Questions, comments, patches, and tests welcomed.

Pm

==========

* Rules are Parrot subroutines that know how to match strings.  To
  compile a rule, one uses the "PGE::p6rule" function:

    .sub main
        .local pmc p6rule
        .local pmc rulesub
        .local pmc match
        load_bytecode "PGE.pbc"
        p6rule = find_global "PGE", "p6rule"

        rulesub = p6rule(":w (\w+) \:= (\S+)")
        match = rulesub("dog := spot")
 
* A rule subroutine returns a "match object" containing the
  results of the match.  In Perl 6, this object will be known as C< $/ >.

* A match object returned from a successful match has the following
  characteristics:
  - true in boolean context
  - 1 in a numeric context (may change later with :g modifier)
  - the string matched in string context
  - .from() and .to() are offsets delimiting the string where 
    the match was found
  - contains other match objects resulting from captured
    subpatterns or subrules in the match

* A rule containing capturing parens gets additional match objects
  for each set of parens.  Thus a rule like:

                             $1        $2
        rulesub = p6rule(":w(\w+) \:= (\S+)")

  captures the word characters prior to the ":=" into $/[0], and 
  the non-space characters following the ":=" into $/[1].  (In Perl 6,
  the $1, $2, ... variables will be aliases to $/[0], $/[1], ... . )

        rulesub = p6rule(":w(\w+) \:= (\S+)")
        match = rulesub(" let foo := 123 ")
        print match                        # outputs "foo := 123"
        $P0 = match[0]                     # first subpattern capture ($1)
        print $P0                          # outputs "foo"
        $P0 = match[1]                     # second subpattern capture ($2)
        print $P0                          # outputs "123"

* If a capture is quantified with any of '+', '*', or '**{m..n}',
  then it generates an array of match objects for the subpattern capture
  instead of a single match object:

        rulesub = p6rule(":w(\w+) \:= (\S+ )*")
        match = rulesub(" foo := zip boom bah")
        print match                        # outputs "foo := zip boom bah"
        $P0 = match[0]                     # first subpattern capture ($1)
        print $P0                          # outputs "foo"
        $P1 = match[1]                     # second subpattern array ($2)
        $P2 = $P1[0]                       # second repetition ($2[0])
        print $P2                          # outputs "zip "
        $P2 = $P1[1]                       # second repetition ($2[1])
        print $P2                          # outputs "boom "

* Match objects for nested captures are nested into the surrounding
  capture object.  Thus, given

        rulesub = p6rule(":w (let) ( (\w+) \:= (\S+) )")
        match = rulesub("let foo := 123")

  the outer match object contains two match objects ($/[0] and $/[1]),
  and the second of these contains two match objects at
  $/[1][0] and $/[1][1].

        print match                        # outputs "let foo := 123"
        $P0 = match[0]                     # first subcapture ($1)
        print $P0                          # outputs "let"
        $P0 = match[1]                     # second subcapture ($2)
        $P1 = $P0[0]                       # first nested capture ($2[0])
        print $P1                          # outputs "foo"
        $P1 = $P0[1]                       # second nested capture ($2[1])
        print $P2                          # outputs "123"

* Non-capturing subpatterns don't nest match objects:

        rulesub = p6rule(":w (let) [ (\w+) \:= (\S+) ]")
        match = rulesub("let foo := 123")
        print match                        # outputs "let foo := 123"
        $P0 = match[0]                     # first subcapture ($1)
        print $P0                          # outputs "let"
        $P0 = match[1]                     # second subcapture ($2)
        print $P0                          # outputs "foo"
        $P0 = match[2]                     # third subcapture ($3)
        print $P0                          # outputs "123"

* To define a subrule, store its subroutine into a symbol table somewhere:

        rulesub = p6rule("int | double | float | char")
        store_global "type", rulesub 
        rulesub = p6rule("\w+")
        store_global "ident", rulesub

* To match a subrule, put the name of the subrule in angle brackets:

        rulesub = p6rule(":w<type> <ident>")
        match = rulesub("   int argc ")
        print match                        # outputs "int argc"

* Subrule captures become named keys in the resulting match object:

        rulesub = p6rule(":w<type> <ident>")
        match = rulesub("   int argc ")
        print match                        # outputs "int argc"
        $P0 = match["type"]                # get type subrule  ($/<type>)
        print $P0                          # outputs "int"
        $P0 = match["ident"]               # get ident match ($/<ident>)
        print $P0                          # outputs "argc" 

* Quantified subrules produce an array of match objects

        rulesub = p6rule(":w<type> <ident> [ , <ident>]*")
        (match) = rulesub("    float alpha, beta, gamma")
        $P0 = match["type"]                # get type subrule ($/<type>)
        print $P0                          # outputs "float"
        $P0 = match["ident"]               # get ident subrule (array)
        $P1 = $P0[0]                       # first ident ($/<ident>[0])
        print $P1                          # outputs "alpha"
        $P1 = $P0[1]                       # second ident ($/<ident>[1])
        print $P1                          # outputs "beta"

* Captures can be aliased via named aliases:

        rulesub = p6rule(":w $<key>:=[\w+] = $<val>:=[\S+]")
        (match) = rulesub("   abc = 123")
        $P0 = match["key"]                 # get "key" capture
        print $P0                          # outputs "abc"
        $P0 = match["val"]                 # get "val" capture
        print $P0                          # outputs "123"

* Or you can use numbered aliases:

        rulesub = p6rule(":w $3:=[\w+] = $1:=[\S+]")
        (match) = rulesub("   abc = 123")
        $P0 = match[0]                     # get $1
        print $P0                          # outputs "123"
        $P0 = match[2]                     # get $3
        print $P0                          # outputs "abc"

PGE provides the "dump" method for match objects to provide
a data dump of the results.  Here's a long example for
parsing arithmetic expressions using the following grammar:

    rule factor { \w+ | \( <expr> \) }
    rule term   {:w <factor> [ (\*|/) <factor> ]* }
    rule expr   {:w <term> [ (\+|-) <term> ]* }

The PIR code is

    .sub _main
        .local pmc p6rule
        .local pmc match
    
        load_bytecode "../../runtime/parrot/library/PGE.pbc"
        p6rule = find_global "PGE", "p6rule"
    
        $P0 = p6rule("\w+ | \\( <expr> \\)")
        store_global "factor", $P0
    
        $P0 = p6rule(":w <factor> [ $<op>:=(\*|/) <factor> ]*")
        store_global "term", $P0
    
        $P0 = p6rule(":w <term> [ $<op>:=(\+|-) <term> ]*")
        store_global "expr", $P0
    
        $P0 = p6rule("<expr>")
        match = $P0("ab * (de + fg) - jk")
        match."dump"("$/")
    .end

When the above is executed, the match."dump" call above 
produces the following output displaying the contents of
the match object in $/:

    $/: <ab * (de + fg) - jk @ 0> 1
    $/<expr>: <ab * (de + fg) - jk @ 0> 1
    $/<expr><term>[0]: <ab * (de + fg)  @ 0> 1
    $/<expr><term>[0]<op>[0]: <* @ 3> 1
    $/<expr><term>[0]<factor>[0]: <ab @ 0> 1
    $/<expr><term>[0]<factor>[1]: <(de + fg) @ 5> 1
    $/<expr><term>[0]<factor>[1]<expr>: <de + fg @ 6> 1
    $/<expr><term>[0]<factor>[1]<expr><term>[0]: <de  @ 6> 1
    $/<expr><term>[0]<factor>[1]<expr><term>[0]<factor>[0]: <de @ 6> 1
    $/<expr><term>[0]<factor>[1]<expr><term>[1]: <fg @ 11> 1
    $/<expr><term>[0]<factor>[1]<expr><term>[1]<factor>[0]: <fg @ 11> 1
    $/<expr><term>[0]<factor>[1]<expr><op>[0]: <+ @ 9> 1
    $/<expr><term>[1]: <jk @ 17> 1
    $/<expr><term>[1]<factor>[0]: <jk @ 17> 1
    $/<expr><op>[0]: <- @ 15> 1

PGE features update

Reply via email to