Re: The dreaded regex patch

Steve Fink Wed, 09 Jan 2002 13:01:32 -0800

On Wed, Jan 09, 2002 at 12:44:45PM -0800, Brent Dax wrote:
> Steve Fink:
> # On Wed, Jan 09, 2002 at 03:16:40AM -0800, Brent Dax wrote:
> #
> # Also, why specific opcodes for \w etc.? I would think you'd do those
> # as constants. (Constant bitmaps for 8-bit, or constant
> # MultiLevelFunkyThings for Unicode.)
> 
> Optimization, m'boy.  I can just do a bunch of less-than/greater-than
> tests.  (Besides, rx_oneof currently doesn't use bitmaps because I was
> in a rush and couldn't figure out how to generate them more efficiently
> than just doing a binary search on text.)


I wonder if the optimization opcodes should be kept separate, so if
you wanted to "port" regular expressions to a totally different type
of input, it would be clear what the bare minimum are that need to be
implemented. rxo_*?

> # The main thrust of the optimizer is to try to find portions of the RE
> # to collapse into a DFA (or DFA-like thing). For example,
> # /(?:foo|bar)d/ is matched identically with a DFA or NFA, while
> # /(?:need|needle)le(ss)?/ is not. But this quickly makes me wonder: are
> # optimizations like finding a fixed "abc" substring within the input
> # really any faster than walking through a lookup table-based state
> # machine? I don't buy the Boyer-Moore argument that it's faster because
> # you can advance by more than one character at a time -- you're
> # stuffing the input into the cache either way, you may as well look at
> # all of it. I suppose if you're doing funky backwards matching for
> # a+bc+ it cuts out some backtracking, but that seems tricky to ensure
> # you're matching the right stuff. /a+bc+/ works well, but
> # /grand(ma|mother)a+bc+/ had better be darn sure it doesn't bite off
> # grandma's toe.
> 
> That was an *example*.  Since I haven't run benchmarks, I cna't say if
> it would be faster or not--I'm just making the point that sometimes
> there are faster ways TDI than the literal translation.

Oh, sorry. I wasn't responding to anything in your patch. Just general
ramblings, mostly thinking about perl5's engine. I am very much in the
nonliteral translation camp.

(?:need|needle) -> need(?:|le) -> need(?:le)??
(?:needle|need) -> need(?:le|) -> need(?:le)?

> By the way, I welcome seeing other regex implementations.  New
> approaches can only improve the final design.

Naw, I'll probably just retarget the optimizer to your opcodes to
avoid redundant work. Yours are extremely similar to mine,
unsurprisingly. I just organized the internals a little differently.

Re: The dreaded regex patch

Reply via email to