Parsing data

Aaron Sherman Wed, 07 Oct 2009 15:45:23 -0700

One of the first things that's becoming obvious to me in playing with
Rakudo's rules is that parsing strings isn't always what I'm going to
want to do. The most common example of wanting to parse data that's
not in string form is the YACC scenario where you want to have a
function produce a stream of tokenized data that is then parsed into a
more complex representation. In similar fashion there's transformers
like TGE that take syntax trees and transform them into alternative
representations.


To that end, I'd like to suggest (for 6.1 or whatever comes after
initial stability) an extension to rules:

 [ 'orange', 'apple', 'apple', 'pear', 'banana' ] ~~ rx :data {
'apple'+ 'pear' }

Adding :data forces the match to proceed against the elements of the
supplied array as a sequence, rather than as individual matches the
way it behaves now. Each element of the array is matched by each atom
in the expression.

To support complex data (instead of matching all elements as fixed
strings), a new matching syntax is proposed. The "current object" in
what follows is the object in the input data which is currently being
treated as an atom (e.g. an array element). It might be any kind of
data such as a sub-array, number or string.

<^...> matches against embedded, complex data. There are several forms
depending on what comes after the ^:

Forms that work on the current element of the input:

 ^{...} smart-matches current object against return value of closure
 ^~exp parses exp as a regex and matches as a string against the
current object (disabling :data locally)
 ^::exp exp is an identifier and smart-matches on type

Note that the second two forms can be implemented (though possibly not
optimally) using the first.

These forms treat the current element of the input as a sub-array and
attempt to traverse it, leaving :data enabled:

 ^[exp] parses exp as a regex and matches against an array object
 ^ name (note space) identical to <^[<name>]>

Example:

This parses a binary operator tree:

  token undef { <^{undef}> }
  token op { < + - * / > } # works because the whole object is a
one-character string
  token term { <^::Num> | <^~ \d+ > | <undef> } # number, string with
digits or undef
  rule binoptree {
    <op>
    $<left> = [ <term> |  <^ binoptree> ]
    $<right> = [ <term> | <^ binoptree> ]
  }

  [ '+', 5, [ '*', 6, 7 ] ] ~~ rx :data /<binoptree>/

Some notes: perhaps this should simply refer to iterable objects and
not arrays? Is there a better way to unify the handling of matching
against the current object vs matching against embedded structures?
What about matching nested hashes?

What I find phenomenal is that this requires so little change to the
existing spec for rules. It's a really simple approach, but give us
the ability to start applying rules in all sorts of ways we never
dreamed of before.

I might even tackle trying to implement this instead of the parser
library I was working on if there's some agreement that it makes sense
and looks like the correct way to go about it....

Parsing data

Reply via email to