Parsing data

2009-10-07 Thread Aaron Sherman
One of the first things that's becoming obvious to me in playing with
Rakudo's rules is that parsing strings isn't always what I'm going to
want to do. The most common example of wanting to parse data that's
not in string form is the YACC scenario where you want to have a
function produce a stream of tokenized data that is then parsed into a
more complex representation. In similar fashion there's transformers
like TGE that take syntax trees and transform them into alternative
representations.

To that end, I'd like to suggest (for 6.1 or whatever comes after
initial stability) an extension to rules:

 [ 'orange', 'apple', 'apple', 'pear', 'banana' ] ~~ rx :data {
'apple'+ 'pear' }

Adding :data forces the match to proceed against the elements of the
supplied array as a sequence, rather than as individual matches the
way it behaves now. Each element of the array is matched by each atom
in the expression.

To support complex data (instead of matching all elements as fixed
strings), a new matching syntax is proposed. The "current object" in
what follows is the object in the input data which is currently being
treated as an atom (e.g. an array element). It might be any kind of
data such as a sub-array, number or string.

<^...> matches against embedded, complex data. There are several forms
depending on what comes after the ^:

Forms that work on the current element of the input:

 ^{...} smart-matches current object against return value of closure
 ^~exp parses exp as a regex and matches as a string against the
current object (disabling :data locally)
 ^::exp exp is an identifier and smart-matches on type

Note that the second two forms can be implemented (though possibly not
optimally) using the first.

These forms treat the current element of the input as a sub-array and
attempt to traverse it, leaving :data enabled:

 ^[exp] parses exp as a regex and matches against an array object
 ^ name (note space) identical to <^[]>

Example:

This parses a binary operator tree:

  token undef { <^{undef}> }
  token op { < + - * / > } # works because the whole object is a
one-character string
  token term { <^::Num> | <^~ \d+ > |  } # number, string with
digits or undef
  rule binoptree {

$ = [  |  <^ binoptree> ]
$ = [  | <^ binoptree> ]
  }

  [ '+', 5, [ '*', 6, 7 ] ] ~~ rx :data //

Some notes: perhaps this should simply refer to iterable objects and
not arrays? Is there a better way to unify the handling of matching
against the current object vs matching against embedded structures?
What about matching nested hashes?

What I find phenomenal is that this requires so little change to the
existing spec for rules. It's a really simple approach, but give us
the ability to start applying rules in all sorts of ways we never
dreamed of before.

I might even tackle trying to implement this instead of the parser
library I was working on if there's some agreement that it makes sense
and looks like the correct way to go about it


Re: Parsing data

2009-10-07 Thread Moritz Lenz
Aaron Sherman wrote:
> One of the first things that's becoming obvious to me in playing with
> Rakudo's rules is that parsing strings isn't always what I'm going to
> want to do. The most common example of wanting to parse data that's
> not in string form is the YACC scenario where you want to have a
> function produce a stream of tokenized data that is then parsed into a
> more complex representation. In similar fashion there's transformers
> like TGE that take syntax trees and transform them into alternative
> representations.
> 
> To that end, I'd like to suggest (for 6.1 or whatever comes after
> initial stability) an extension to rules:

Did you read
http://perlcabal.org/syn/S05.html#Matching_against_non-strings already?

It's not yet implemented by any compiler, but seems to (mostly) do what
you want already, no?

Cheers,
Moritz


Re: Parsing data

2009-10-07 Thread Aaron Sherman
Sorry, I accidentally took the thread off-list. Re-posting some of my
comments below:

On Wed, Oct 7, 2009 at 6:50 PM, Moritz Lenz  wrote:
> Aaron Sherman wrote:
>> One of the first things that's becoming obvious to me in playing with
>> Rakudo's rules is that parsing strings isn't always what I'm going to
>> want to do. The most common example of wanting to parse data that's
>> not in string form is the YACC scenario where you want to have a
>> function produce a stream of tokenized data that is then parsed into a
>> more complex representation. In similar fashion there's transformers
>> like TGE that take syntax trees and transform them into alternative
>> representations.
>>
>> To that end, I'd like to suggest (for 6.1 or whatever comes after
>> initial stability) an extension to rules:
>
> Did you read
> http://perlcabal.org/syn/S05.html#Matching_against_non-strings already?

(I went off and read that, and then replied to Moritz):

OK, no. That proposal only does part of the work. It would suffice for
something like the lexer/parser connectivity, but transforming complex
data structures would be impossible. By contrast, what I've suggested
would work for both cases. It also preserves the existing  ~~
/b/ functionality that we have today, and it's not entirely clear to
me that the proposal that you linked to does.

So, to re-cap:

:data turns on data mode which treats the input as an iterator and
matches each atom in the regex against objects returned by the
iterator (must be rewindable? or do we cache when it's not?)

Then, inside the regex we use <^...> to match complex data such as:

<^~ ... > - match digits in a single element (equiv to <,> \d+ <,> in
the proposal you linked), with :data turned off
<^{ ... }> - smart match the return value of the closure against current element
<^::identifier> - Smart match a type against the current element
<^[...]> - Descend the current element, which should be iterable
itself and match with :data turned on
<^ name> - Same as <^[]>

This should be powerful enough to match any arbitrarily nested set of
iterable objects. I think it will be particularly useful against parse
trees (and similar structures such as XML/HTML DOMs) and scanner
productions, though users will probably find nearly infinite uses for
it, much like original regular expressions.


Re: Parsing data

2009-10-07 Thread David Green

On 2009-Oct-7, at 5:18 pm, Aaron Sherman wrote:

This should be powerful enough to match any arbitrarily nested set of
iterable objects. I think it will be particularly useful against parse
trees (and similar structures such as XML/HTML DOMs) and scanner
productions, though users will probably find nearly infinite uses for
it, much like original regular expressions.


I agree that being able to parse data structure would be *extremely*  
useful.  (I think I posted a suggestion like that at one time, though  
I didn't propose any syntax.)  There is already a way to parse data --  
Signatures, but not with the full power that grammars can apply to text.


Data-grammars could do everything that Signatures do, and more, though  
it's still worth having special syntax designed to fit the special and  
ubiquitous case of sub signatures.  Trying to expand signature-syntax  
to cover everything grammars can do would probably end up too ugly;  
nevertheless, if we had full grammars to build on, I'm sure the Sig- 
syntax could be extended quite a lot before it got too cumbersome.  It  
would also open the way for people to build custom sig-parsers (with  
their own special syntax or not).


It also might be worth inventing a whole new syntax design for parsing  
and manipulating data structures, but your suggested extensions seem  
pretty good to me.



-David



Re: Overloading Roles

2009-10-07 Thread David Green

On 2009-Oct-5, at 3:41 pm, Jon Lang wrote:
Concerning that last one: would it be reasonable to have a Discrete  
role

that provides a .succ method, and then overload the Range role?


I think a type needs to be Discrete and Ordered for successors to make  
sense (e.g. consider a discrete unordered type like Complex ints).


I'm still thinking a Discrete type is the same as a Set; so perhaps  
Discrete & Ordered means "Positional"?  But that doesn't feel right --  
although any ordered set can be represented as an array, that seems to  
confuse the idea of "order" that is intended.


(And perhaps "Discrete" should be a different type from "Set" even if  
they do work out the same, simply to better document one's intent.)




-David



Re: Parsing data

2009-10-07 Thread Aaron Sherman
On Thu, Oct 8, 2009 at 12:57 AM, David Green  wrote:

> I agree that being able to parse data structure would be *extremely* useful.
>  (I think I posted a suggestion like that at one time, though I didn't
> propose any syntax.)  There is already a way to parse data -- Signatures,
> but not with the full power that grammars can apply to text.

I'm certain I'm not the first person to think of it, especially since
the link that Moritz provided came so close.

Just to get a sense of the scope of it, I've been playing around with
the code tonight. Modifying PGE to handle the syntactic changes was
trivial, but of course, it's the semantic shift of allowing the input
to a regex to be a data structure other than a string that will take
some digging around and re-tooling (many assumptions need to be
re-visited like what "pos" means in the context of a multi-level
datastructure match!)

I'll have another go at that tomorrow and see how much work it's
likely to be, but I'm still thinking this is something to push forward
into 6.1, and not try to hold up any work on 6.0 for.

I suppose if I want to be nice to others, I should come up with a
patch against the STD as well, since there's now an active project
using it to compile code.