On Wed, Aug 28, 2002 at 01:18:19PM -0400, Deven T. Corzine wrote: > Re: [perl6-language] Does ::: constrain the pattern engine implementation? > > On Wed, 28 Aug 2002, Dan Sugalski wrote: > > > A medium-term goal for the regex engine is to note where a DFA would give > > correct behaviour and use one, rather than going through the more > > expensive generalized regex engine we'd otherwise use. > > What would it take to make an experimental prototype of this? > ... > Suppose that someday we have a DFA-based regex engine that can only handle > a subset of the full pattern syntax. I hope that we'll still be able to > automatically factor out subpatterns of more complex patterns, so that the > parts which are simple enough can be implemented with a DFA, and let the > expensive generalized regex engine handle the rest... (I hope it wouldn't > be necessary for the programmer to explicitly separate out the subpatterns > which could be optimized!)
Check out the parrot source and read languages/regex/README. I talk about precisely this optimization in the section "Longer-term optimization vague ideas". (Heh.) I thought a bunch more about this a while back, but didn't put any of it in the README. Didn't really come up with any brilliant ideas anyway. languages/regex is set up to make it straightforward to implement precisely this type of transparent optimization. Of course, nobody but me has ever attempted doing anything, but if you'd like to give it a shot, here's how: In languages/regex/lib/Regex/Ops/Tree.pm, there is a start of an implementation of this, done back before I got distracted by the urge to make the regex compiler actually work for simple cases first. ;-) The way it works is that you have a tree of regex operator nodes, each represented by a Tree object. There are various methods on these nodes that compute information that might be useful to an optimizer. For example, minlen() and maxlen() return the fewest and most characters that a subexpression could possibly match; it's used to convert things like /a[bB]c+def/ to a code sequence resembling check length >= 6 match a match [bB] match c+ check length >= 3 match d match e match f rather than checking the input stream's length just before matching any single character. Anyway, I also have a method dfa_safe() that is supposed to return whether or not a subtree would return the same result on a DFA or traditional backtracking NFA. The implementation is mostly trivial (if you see capturing parens, FALSE. A sequence of nodes is only dfa_safe if all subnodes are. Etc.) For alternation, it is conservative and only returns true if the first character in every branch of the alternation is different, and if every branch is itself dfa_safe. (And I've never run the code, even once, so it doesn't actually work anyway!) The code for making use of this information is in Regex/PreOptimize.pm (possibly to be renamed to TreeOptimize.pm). Look at disable_implicit_checks() for an example of performing the length checking optimization described above. This whole module is for adding, deleting, and changing nodes in the expression tree according to a set of optimizations. You can also subclass it to override the optimize_tree() method to add in your optimization, like so: sub optimize_tree { my ($self, tree) = @_; $tree = $self->SUPER::optimize_tree($tree); ...perform your optimization... return $tree; } The next step is to convert the optimized tree into what I call List ops, which are just slightly higher level than pasm ops. You might want to create a state table lookup op, for example. Look at Regex/Rewrite.pm and Regex/Rewrite/Stackless.pm for many examples of this (Stackless subclasses Rewrite and does most of the work.) Finally, you'll need to implement the pasm rendering of your new List ops. That goes in Regex/CodeGen/Pasm.pm. The only major thing that I might anticipate changing anytime soon is that I might add another CodeGen backend for PIR, the stuff languages/imcc feeds off of. But I'll probably still use pasm as the fallback, so it should still work to just implement things in Pasm.pm. See the top of README for how to run things. I'll be happy to have someone else playing with this stuff while I'm off integrating with perl6, so don't hesitate to ask questions. And once you start to get involved, we'll suck you into working on the immediate needs sooner or later, never fear! Anyone else notice that imcc eats something called PIR, for _P_arrot _I_ntermediate um... _R_anguage? I think Melvin was avoiding PIL because it might tempt someone to reimplement it and advertise it with "Okay, people, it's time for Parrot to swallow a better PIL!" Ouch. Apologies in advance.