Parsing data
One of the first things that's becoming obvious to me in playing with Rakudo's rules is that parsing strings isn't always what I'm going to want to do. The most common example of wanting to parse data that's not in string form is the YACC scenario where you want to have a function produce a stream of tokenized data that is then parsed into a more complex representation. In similar fashion there's transformers like TGE that take syntax trees and transform them into alternative representations. To that end, I'd like to suggest (for 6.1 or whatever comes after initial stability) an extension to rules: [ 'orange', 'apple', 'apple', 'pear', 'banana' ] ~~ rx :data { 'apple'+ 'pear' } Adding :data forces the match to proceed against the elements of the supplied array as a sequence, rather than as individual matches the way it behaves now. Each element of the array is matched by each atom in the expression. To support complex data (instead of matching all elements as fixed strings), a new matching syntax is proposed. The "current object" in what follows is the object in the input data which is currently being treated as an atom (e.g. an array element). It might be any kind of data such as a sub-array, number or string. <^...> matches against embedded, complex data. There are several forms depending on what comes after the ^: Forms that work on the current element of the input: ^{...} smart-matches current object against return value of closure ^~exp parses exp as a regex and matches as a string against the current object (disabling :data locally) ^::exp exp is an identifier and smart-matches on type Note that the second two forms can be implemented (though possibly not optimally) using the first. These forms treat the current element of the input as a sub-array and attempt to traverse it, leaving :data enabled: ^[exp] parses exp as a regex and matches against an array object ^ name (note space) identical to <^[]> Example: This parses a binary operator tree: token undef { <^{undef}> } token op { < + - * / > } # works because the whole object is a one-character string token term { <^::Num> | <^~ \d+ > | } # number, string with digits or undef rule binoptree { $ = [ | <^ binoptree> ] $ = [ | <^ binoptree> ] } [ '+', 5, [ '*', 6, 7 ] ] ~~ rx :data // Some notes: perhaps this should simply refer to iterable objects and not arrays? Is there a better way to unify the handling of matching against the current object vs matching against embedded structures? What about matching nested hashes? What I find phenomenal is that this requires so little change to the existing spec for rules. It's a really simple approach, but give us the ability to start applying rules in all sorts of ways we never dreamed of before. I might even tackle trying to implement this instead of the parser library I was working on if there's some agreement that it makes sense and looks like the correct way to go about it
Re: Parsing data
Aaron Sherman wrote: > One of the first things that's becoming obvious to me in playing with > Rakudo's rules is that parsing strings isn't always what I'm going to > want to do. The most common example of wanting to parse data that's > not in string form is the YACC scenario where you want to have a > function produce a stream of tokenized data that is then parsed into a > more complex representation. In similar fashion there's transformers > like TGE that take syntax trees and transform them into alternative > representations. > > To that end, I'd like to suggest (for 6.1 or whatever comes after > initial stability) an extension to rules: Did you read http://perlcabal.org/syn/S05.html#Matching_against_non-strings already? It's not yet implemented by any compiler, but seems to (mostly) do what you want already, no? Cheers, Moritz
Re: Parsing data
Sorry, I accidentally took the thread off-list. Re-posting some of my comments below: On Wed, Oct 7, 2009 at 6:50 PM, Moritz Lenz wrote: > Aaron Sherman wrote: >> One of the first things that's becoming obvious to me in playing with >> Rakudo's rules is that parsing strings isn't always what I'm going to >> want to do. The most common example of wanting to parse data that's >> not in string form is the YACC scenario where you want to have a >> function produce a stream of tokenized data that is then parsed into a >> more complex representation. In similar fashion there's transformers >> like TGE that take syntax trees and transform them into alternative >> representations. >> >> To that end, I'd like to suggest (for 6.1 or whatever comes after >> initial stability) an extension to rules: > > Did you read > http://perlcabal.org/syn/S05.html#Matching_against_non-strings already? (I went off and read that, and then replied to Moritz): OK, no. That proposal only does part of the work. It would suffice for something like the lexer/parser connectivity, but transforming complex data structures would be impossible. By contrast, what I've suggested would work for both cases. It also preserves the existing ~~ /b/ functionality that we have today, and it's not entirely clear to me that the proposal that you linked to does. So, to re-cap: :data turns on data mode which treats the input as an iterator and matches each atom in the regex against objects returned by the iterator (must be rewindable? or do we cache when it's not?) Then, inside the regex we use <^...> to match complex data such as: <^~ ... > - match digits in a single element (equiv to <,> \d+ <,> in the proposal you linked), with :data turned off <^{ ... }> - smart match the return value of the closure against current element <^::identifier> - Smart match a type against the current element <^[...]> - Descend the current element, which should be iterable itself and match with :data turned on <^ name> - Same as <^[]> This should be powerful enough to match any arbitrarily nested set of iterable objects. I think it will be particularly useful against parse trees (and similar structures such as XML/HTML DOMs) and scanner productions, though users will probably find nearly infinite uses for it, much like original regular expressions.
Re: Parsing data
On 2009-Oct-7, at 5:18 pm, Aaron Sherman wrote: This should be powerful enough to match any arbitrarily nested set of iterable objects. I think it will be particularly useful against parse trees (and similar structures such as XML/HTML DOMs) and scanner productions, though users will probably find nearly infinite uses for it, much like original regular expressions. I agree that being able to parse data structure would be *extremely* useful. (I think I posted a suggestion like that at one time, though I didn't propose any syntax.) There is already a way to parse data -- Signatures, but not with the full power that grammars can apply to text. Data-grammars could do everything that Signatures do, and more, though it's still worth having special syntax designed to fit the special and ubiquitous case of sub signatures. Trying to expand signature-syntax to cover everything grammars can do would probably end up too ugly; nevertheless, if we had full grammars to build on, I'm sure the Sig- syntax could be extended quite a lot before it got too cumbersome. It would also open the way for people to build custom sig-parsers (with their own special syntax or not). It also might be worth inventing a whole new syntax design for parsing and manipulating data structures, but your suggested extensions seem pretty good to me. -David
Re: Overloading Roles
On 2009-Oct-5, at 3:41 pm, Jon Lang wrote: Concerning that last one: would it be reasonable to have a Discrete role that provides a .succ method, and then overload the Range role? I think a type needs to be Discrete and Ordered for successors to make sense (e.g. consider a discrete unordered type like Complex ints). I'm still thinking a Discrete type is the same as a Set; so perhaps Discrete & Ordered means "Positional"? But that doesn't feel right -- although any ordered set can be represented as an array, that seems to confuse the idea of "order" that is intended. (And perhaps "Discrete" should be a different type from "Set" even if they do work out the same, simply to better document one's intent.) -David
Re: Parsing data
On Thu, Oct 8, 2009 at 12:57 AM, David Green wrote: > I agree that being able to parse data structure would be *extremely* useful. > (I think I posted a suggestion like that at one time, though I didn't > propose any syntax.) There is already a way to parse data -- Signatures, > but not with the full power that grammars can apply to text. I'm certain I'm not the first person to think of it, especially since the link that Moritz provided came so close. Just to get a sense of the scope of it, I've been playing around with the code tonight. Modifying PGE to handle the syntactic changes was trivial, but of course, it's the semantic shift of allowing the input to a regex to be a data structure other than a string that will take some digging around and re-tooling (many assumptions need to be re-visited like what "pos" means in the context of a multi-level datastructure match!) I'll have another go at that tomorrow and see how much work it's likely to be, but I'm still thinking this is something to push forward into 6.1, and not try to hold up any work on 6.0 for. I suppose if I want to be nice to others, I should come up with a patch against the STD as well, since there's now an active project using it to compile code.