Re: The dreaded regex patch

Steve Fink Wed, 09 Jan 2002 12:09:42 -0800

On Wed, Jan 09, 2002 at 03:16:40AM -0800, Brent Dax wrote:
> Okay, here it is.  Attached is the regular expression patch.  It


Is rx_advance necessary? What's the difference between

  /R/

and

  /^(.*?R)/

if you count the parens $& $1 $2 ... instead of $1 $2 $3 ...?

Speed, I guess?

(Okay, I really meant /\A([\w\W]*?R)/ )

Also, why specific opcodes for \w etc.? I would think you'd do those
as constants. (Constant bitmaps for 8-bit, or constant
MultiLevelFunkyThings for Unicode.)

I have a barely begun parrot RE implementation of my own, and a much
further along RE opcode generator & optimizer, but no time to work on
either. Maybe I should dust off the optimizer.

The main thrust of the optimizer is to try to find portions of the RE
to collapse into a DFA (or DFA-like thing). For example,
/(?:foo|bar)d/ is matched identically with a DFA or NFA, while
/(?:need|needle)le(ss)?/ is not. But this quickly makes me wonder: are
optimizations like finding a fixed "abc" substring within the input
really any faster than walking through a lookup table-based state
machine? I don't buy the Boyer-Moore argument that it's faster because
you can advance by more than one character at a time -- you're
stuffing the input into the cache either way, you may as well look at
all of it. I suppose if you're doing funky backwards matching for
a+bc+ it cuts out some backtracking, but that seems tricky to ensure
you're matching the right stuff. /a+bc+/ works well, but
/grand(ma|mother)a+bc+/ had better be darn sure it doesn't bite off
grandma's toe.

Not sure why I'm writing up these musings, since they don't apply to
the opcode engine.

Re: The dreaded regex patch

Reply via email to