On Wed, Jan 09, 2002 at 03:16:40AM -0800, Brent Dax wrote:
> Okay, here it is.  Attached is the regular expression patch.  It

Is rx_advance necessary? What's the difference between

  /R/

and

  /^(.*?R)/

if you count the parens $& $1 $2 ... instead of $1 $2 $3 ...?

Speed, I guess?

(Okay, I really meant /\A([\w\W]*?R)/ )

Also, why specific opcodes for \w etc.? I would think you'd do those
as constants. (Constant bitmaps for 8-bit, or constant
MultiLevelFunkyThings for Unicode.)

I have a barely begun parrot RE implementation of my own, and a much
further along RE opcode generator & optimizer, but no time to work on
either. Maybe I should dust off the optimizer.

The main thrust of the optimizer is to try to find portions of the RE
to collapse into a DFA (or DFA-like thing). For example,
/(?:foo|bar)d/ is matched identically with a DFA or NFA, while
/(?:need|needle)le(ss)?/ is not. But this quickly makes me wonder: are
optimizations like finding a fixed "abc" substring within the input
really any faster than walking through a lookup table-based state
machine? I don't buy the Boyer-Moore argument that it's faster because
you can advance by more than one character at a time -- you're
stuffing the input into the cache either way, you may as well look at
all of it. I suppose if you're doing funky backwards matching for
a+bc+ it cuts out some backtracking, but that seems tricky to ensure
you're matching the right stuff. /a+bc+/ works well, but
/grand(ma|mother)a+bc+/ had better be darn sure it doesn't bite off
grandma's toe.

Not sure why I'm writing up these musings, since they don't apply to
the opcode engine.

Reply via email to