On Wed, Jan 09, 2002 at 12:44:45PM -0800, Brent Dax wrote: > Steve Fink: > # On Wed, Jan 09, 2002 at 03:16:40AM -0800, Brent Dax wrote: > # > # Also, why specific opcodes for \w etc.? I would think you'd do those > # as constants. (Constant bitmaps for 8-bit, or constant > # MultiLevelFunkyThings for Unicode.) > > Optimization, m'boy. I can just do a bunch of less-than/greater-than > tests. (Besides, rx_oneof currently doesn't use bitmaps because I was > in a rush and couldn't figure out how to generate them more efficiently > than just doing a binary search on text.)
I wonder if the optimization opcodes should be kept separate, so if you wanted to "port" regular expressions to a totally different type of input, it would be clear what the bare minimum are that need to be implemented. rxo_*? > # The main thrust of the optimizer is to try to find portions of the RE > # to collapse into a DFA (or DFA-like thing). For example, > # /(?:foo|bar)d/ is matched identically with a DFA or NFA, while > # /(?:need|needle)le(ss)?/ is not. But this quickly makes me wonder: are > # optimizations like finding a fixed "abc" substring within the input > # really any faster than walking through a lookup table-based state > # machine? I don't buy the Boyer-Moore argument that it's faster because > # you can advance by more than one character at a time -- you're > # stuffing the input into the cache either way, you may as well look at > # all of it. I suppose if you're doing funky backwards matching for > # a+bc+ it cuts out some backtracking, but that seems tricky to ensure > # you're matching the right stuff. /a+bc+/ works well, but > # /grand(ma|mother)a+bc+/ had better be darn sure it doesn't bite off > # grandma's toe. > > That was an *example*. Since I haven't run benchmarks, I cna't say if > it would be faster or not--I'm just making the point that sometimes > there are faster ways TDI than the literal translation. Oh, sorry. I wasn't responding to anything in your patch. Just general ramblings, mostly thinking about perl5's engine. I am very much in the nonliteral translation camp. (?:need|needle) -> need(?:|le) -> need(?:le)?? (?:needle|need) -> need(?:le|) -> need(?:le)? > By the way, I welcome seeing other regex implementations. New > approaches can only improve the final design. Naw, I'll probably just retarget the optimizer to your opcodes to avoid redundant work. Yours are extremely similar to mine, unsurprisingly. I just organized the internals a little differently.