RE: The dreaded regex patch

Brent Dax Wed, 09 Jan 2002 12:39:31 -0800

Steve Fink:
# On Wed, Jan 09, 2002 at 03:16:40AM -0800, Brent Dax wrote:
# > Okay, here it is.  Attached is the regular expression patch.  It
#
# Is rx_advance necessary? What's the difference between
#
#   /R/
#
# and
#
#   /^(.*?R)/
#
# if you count the parens $& $1 $2 ... instead of $1 $2 $3 ...?


Er, that would combine $` and $&.  I assume you meant

        /^.*?(R)/

Well, here are my reasons:

1. Intuitiveness
        Start and end indices for the match are different psychologically from
group start and end.
2. Indirections
        Start and end indices are stored in PerlArrays.  Best to keep their use
down to a minimum.
3. Understandability
        "What's all this extra crap at the beginning of my regex?"  "Why are we
looking up group 0?"
4. Right matching vs. backwards matching
        See the definition of the /r flag in the docs for rx_setprops.

# Speed, I guess?
#
# (Okay, I really meant /\A([\w\W]*?R)/ )
#
# Also, why specific opcodes for \w etc.? I would think you'd do those
# as constants. (Constant bitmaps for 8-bit, or constant
# MultiLevelFunkyThings for Unicode.)

Optimization, m'boy.  I can just do a bunch of less-than/greater-than
tests.  (Besides, rx_oneof currently doesn't use bitmaps because I was
in a rush and couldn't figure out how to generate them more efficiently
than just doing a binary search on text.)

# I have a barely begun parrot RE implementation of my own, and a much
# further along RE opcode generator & optimizer, but no time to work on
# either. Maybe I should dust off the optimizer.

Go ahead.

# The main thrust of the optimizer is to try to find portions of the RE
# to collapse into a DFA (or DFA-like thing). For example,
# /(?:foo|bar)d/ is matched identically with a DFA or NFA, while
# /(?:need|needle)le(ss)?/ is not. But this quickly makes me wonder: are
# optimizations like finding a fixed "abc" substring within the input
# really any faster than walking through a lookup table-based state
# machine? I don't buy the Boyer-Moore argument that it's faster because
# you can advance by more than one character at a time -- you're
# stuffing the input into the cache either way, you may as well look at
# all of it. I suppose if you're doing funky backwards matching for
# a+bc+ it cuts out some backtracking, but that seems tricky to ensure
# you're matching the right stuff. /a+bc+/ works well, but
# /grand(ma|mother)a+bc+/ had better be darn sure it doesn't bite off
# grandma's toe.

That was an *example*.  Since I haven't run benchmarks, I cna't say if
it would be faster or not--I'm just making the point that sometimes
there are faster ways TDI than the literal translation.

By the way, I welcome seeing other regex implementations.  New
approaches can only improve the final design.

--Brent Dax
[EMAIL PROTECTED]
Configure pumpking for Perl 6

<obra> mmmm. hawt sysadmin chx0rs
<lathos> This is sad. I know of *a* hawt sysamin chx0r.
<obra> I know more than a few.
<lathos> obra: There are two? Are you sure it's not the same one?

RE: The dreaded regex patch

Reply via email to