On Wed, Jan 09, 2002 at 03:16:40AM -0800, Brent Dax wrote: > Okay, here it is. Attached is the regular expression patch. It
Is rx_advance necessary? What's the difference between /R/ and /^(.*?R)/ if you count the parens $& $1 $2 ... instead of $1 $2 $3 ...? Speed, I guess? (Okay, I really meant /\A([\w\W]*?R)/ ) Also, why specific opcodes for \w etc.? I would think you'd do those as constants. (Constant bitmaps for 8-bit, or constant MultiLevelFunkyThings for Unicode.) I have a barely begun parrot RE implementation of my own, and a much further along RE opcode generator & optimizer, but no time to work on either. Maybe I should dust off the optimizer. The main thrust of the optimizer is to try to find portions of the RE to collapse into a DFA (or DFA-like thing). For example, /(?:foo|bar)d/ is matched identically with a DFA or NFA, while /(?:need|needle)le(ss)?/ is not. But this quickly makes me wonder: are optimizations like finding a fixed "abc" substring within the input really any faster than walking through a lookup table-based state machine? I don't buy the Boyer-Moore argument that it's faster because you can advance by more than one character at a time -- you're stuffing the input into the cache either way, you may as well look at all of it. I suppose if you're doing funky backwards matching for a+bc+ it cuts out some backtracking, but that seems tricky to ensure you're matching the right stuff. /a+bc+/ works well, but /grand(ma|mother)a+bc+/ had better be darn sure it doesn't bite off grandma's toe. Not sure why I'm writing up these musings, since they don't apply to the opcode engine.