Re: Possible regex engine optimizations?

Steve Fink Wed, 28 Aug 2002 11:52:55 -0700

On Wed, Aug 28, 2002 at 01:18:19PM -0400, Deven T. Corzine wrote:
> Re: [perl6-language] Does ::: constrain the pattern engine implementation?
> 
> On Wed, 28 Aug 2002, Dan Sugalski wrote:
> 
> > A medium-term goal for the regex engine is to note where a DFA would give 
> > correct behaviour and use one, rather than going through the more 
> > expensive generalized regex engine we'd otherwise use.
> 
> What would it take to make an experimental prototype of this?
> ...
> Suppose that someday we have a DFA-based regex engine that can only handle 
> a subset of the full pattern syntax.  I hope that we'll still be able to 
> automatically factor out subpatterns of more complex patterns, so that the 
> parts which are simple enough can be implemented with a DFA, and let the 
> expensive generalized regex engine handle the rest...  (I hope it wouldn't 
> be necessary for the programmer to explicitly separate out the subpatterns 
> which could be optimized!)


Check out the parrot source and read languages/regex/README. I talk
about precisely this optimization in the section "Longer-term
optimization vague ideas". (Heh.) I thought a bunch more about this a
while back, but didn't put any of it in the README. Didn't really come
up with any brilliant ideas anyway.

languages/regex is set up to make it straightforward to implement
precisely this type of transparent optimization. Of course, nobody but
me has ever attempted doing anything, but if you'd like to give it a
shot, here's how:

In languages/regex/lib/Regex/Ops/Tree.pm, there is a start of an
implementation of this, done back before I got distracted by the urge
to make the regex compiler actually work for simple cases first. ;-)
The way it works is that you have a tree of regex operator nodes, each
represented by a Tree object. There are various methods on these nodes
that compute information that might be useful to an optimizer. For
example, minlen() and maxlen() return the fewest and most characters
that a subexpression could possibly match; it's used to convert things
like /a[bB]c+def/ to a code sequence resembling

  check length >= 6
  match a
  match [bB]
  match c+
  check length >= 3
  match d
  match e
  match f

rather than checking the input stream's length just before matching
any single character.

Anyway, I also have a method dfa_safe() that is supposed to return
whether or not a subtree would return the same result on a DFA or
traditional backtracking NFA. The implementation is mostly trivial (if
you see capturing parens, FALSE. A sequence of nodes is only dfa_safe
if all subnodes are. Etc.) For alternation, it is conservative and
only returns true if the first character in every branch of the
alternation is different, and if every branch is itself dfa_safe. (And
I've never run the code, even once, so it doesn't actually work
anyway!)

The code for making use of this information is in Regex/PreOptimize.pm
(possibly to be renamed to TreeOptimize.pm). Look at
disable_implicit_checks() for an example of performing the length
checking optimization described above. This whole module is for
adding, deleting, and changing nodes in the expression tree according
to a set of optimizations. You can also subclass it to override the
optimize_tree() method to add in your optimization, like so:

 sub optimize_tree {
     my ($self, tree) = @_;
     $tree = $self->SUPER::optimize_tree($tree);
     ...perform your optimization...
     return $tree;
 }

The next step is to convert the optimized tree into what I call List
ops, which are just slightly higher level than pasm ops. You might
want to create a state table lookup op, for example. Look at
Regex/Rewrite.pm and Regex/Rewrite/Stackless.pm for many examples of
this (Stackless subclasses Rewrite and does most of the work.)

Finally, you'll need to implement the pasm rendering of your new List
ops. That goes in Regex/CodeGen/Pasm.pm.

The only major thing that I might anticipate changing anytime soon is
that I might add another CodeGen backend for PIR, the stuff
languages/imcc feeds off of. But I'll probably still use pasm as the
fallback, so it should still work to just implement things in Pasm.pm.

See the top of README for how to run things. I'll be happy to have
someone else playing with this stuff while I'm off integrating with
perl6, so don't hesitate to ask questions. And once you start to get
involved, we'll suck you into working on the immediate needs sooner or
later, never fear!

Anyone else notice that imcc eats something called PIR, for _P_arrot
_I_ntermediate um... _R_anguage? I think Melvin was avoiding PIL
because it might tempt someone to reimplement it and advertise it with
"Okay, people, it's time for Parrot to swallow a better PIL!"

Ouch. Apologies in advance.

Re: Possible regex engine optimizations?

Reply via email to