On Sun, Oct 12, 2008 at 05:34:49PM +0200, Moritz Lenz wrote: : Patrick R. Michaud wrote: : > On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote: : >> When we write regexes, we generally capture stuff in a way that makes : >> the following semantic analysis easier. For example we could have a : >> regex m/ <this>+ <that>? <this>*/ if we're only interested in the match : >> trees of what <this> and <that> matches, not their respective order. : >> [...] : >> But if you want to re-used the match tree for something different (say, : >> instead of doing a semantic analysis we want to do syntax hilighting) : >> it's rather hard to reconstruct the original text, and what part of it : >> was matched by which subrule. : > : > Perhaps aliases...? : > : > m/ <this>+ <that>? <andthen=this>* / : > : > This is probably not exactly what you're looking for, but : > that would be what I would look at for this specific example. : : I'm looking more for a general solution for which you don't have to : manipulate the rule itself, and which should ideally work with as little : knowledge of the rule as possible. : : Just see through which loops STD5_dump_match (in the same dir as STD.pm) : has to jump to get a grab of the parse tree in the right order. : : Moritz
Yes, funny thing is I was just thinking about the same thing this morning after Mitchell Charity noticed that elsifs were missing from the tree. It will be relatively trivial to do this with STD, since it already produces a general mapping from position to hash, which it uses to cache whitespace matches and line numbers, but could easily record what matched where. (See the .<_> hash for that.) In my case, I was wanting to find the set of non-whitespace things that are parsed but don't end up in the parse tree. Maybe the :keepall modifier needs access to something like this as well. It may also let me remove the kludge whereby ~ remembers the delimiters on either side. It could also revolutionize the implementation of split. :) My big question is how best to make this ordered info available within a Match, given that we currently use the Positional role for something else. An argument could be made that this info is more important than revealing $0,$1 etc at the top level of the Match, that is, that split semantics are more natural than comb semantics for @($/). One data point is that the STD grammar uses very little $0 and then only as a named parameter that happens to have a numeric name. So we could easily demote $0 etc to meaning $/.numbered[0] or some such. Of course, it goes the other way too, and we can reveal the splits via a .split method or some such. Plus we can have multiple levels of splitting semantics, so then *they'd* be fighting over Positional if we made one of them default. So I'm thinking @($/) stays the way it is, but .splits might return the top-level splits for a given rule, where strings are intermixed with child tree nodes, whereas something like .allsplits might return all the ordered strings along with mappings to what parsed them. If we did that, then there's the question of whether .splits needs to run the pattern lazily so that we can do a limited /':'/.splits(4) and such. That may turn out to be abuse of the lazy system though. And technically, that regex *isn't* binding the colons to a child node, so there's a little semantic mismatch there as well, since a split implemented in terms of .splits would look more like /.*?(':')/. So maybe .splits is the wrong name. Suggestions welcome. The cool thing about .allsplits is that if you doing, say, syntax highlighting on the fly in an editor, it might be relatively easy to run down the list and determine top-level nodes that limit how much needs to be reparsed. Contrariwise, with the "fate" system of STD it might even be relatively easy to put the parser back into a state that was deeply recursive and restart the parse at any point. 'Course, "relatively easy" is one o' them relative concepts... :) Larry