On Wed, Sep 01, 2004 at 01:07:49PM -0700, Larry Wall wrote: > On Wed, Sep 01, 2004 at 01:57:32PM -0400, Dan Sugalski wrote: > : I promised Patrick this a while back but never got it, so here it is. > : > : This is a list of the semantics that I see as needed for a regex > : engine. When we have 'em, we'll map them to string ops, and may well > : add in some special-case code for faster access. > : > : *) extract substring > > Mostly you don't want to go to the trouble of extracting substrings > until you're forced to, because you're continually creating and > destroying backrefs into your string, and you don't want to be > copying characters around for that. As long as there's some kind > of COWish semantics to keep around an original copy of the string > being searched, that's probably all the regex engine itself wants.
Indeed, what I've been working towards in my head is that a rule would consist of a set of "components" which can nest and can refer to other rules, and each component would simply keep track of its current start and end positions of what it matches in the string, as well as what to do if that component needs to backtrack/continue because a successive component is unable to achieve its part of the match (or if we're looking for an C<:exhaustive> set of matches). As Larry mentioned in a previous conversation, the rules engine will need to need to remember its state without keeping a large stack (fortunately there appear to be a number of possible optimizations here). So, what a rule component really wants to do is to be able to perform comparisons and tests of substrings in-place, as opposed to extracting them to perform the match. Opcodes to support that would be most helpful (they may already be there--I just haven't looked at that part yet). And yes, my current state of thinking is that each rule component is a Parrot object of some sort that knows where it fits in the rule and can pass information and control flow to/from its neighbors in trying to get the rule(s) to match the target string. > : *) find first character of class X in string > : *) find first character not of class X in string > We're gonna run into the "what's a character?" issue here, especially > at higher Unicode levels where what the user things of as a single > character is really a sequence of codepoints. > : [...] > : *) create new class X > : *) add or subtract character to class X > : *) create union|intersection|difference of two classes > Not sure you really need opcodes for those if character classes are > just real objects with a particular interface. Indeed, I was thinking that character classes would just be more instances of rule component objects, and these would have methods for building unions/intersections/differences. Or, the rules compiler itself can do much of this when it constructs the character class components, rather than try to build them dynamically in Parrot. > (Also, the Perl 6 parser may be based on the notion of a set of hash keys > being treated a bit like a lex grammar, Yes, I think this is likely. > (Also, minor nit, I'm not sure "find" is the right verb here and > elsewhere. Mostly the regex engine just wants to check the "current" > location. The rest is control flow.) Yes, I'm (pardon the pun) finding that advanced "string find" operations may not be of great importance until we start looking at some special-case optimizations. I'd say to wait for those opcodes until we find (sorry!) we need them. > I see one other potential gotcha with respect to backtracking and > closures. In P6, a closure can declare a hypothetical variable > that is restored only if the closure exits "unsuccessfully". [...] Indeed, but I'm gonna cross that bridge when I get to it. Pm