On Sun, 25 Nov 2001 19:34:15 -0800, Brent Dax wrote: >Perl 5's REs will always appear faster because Perl 5 has an >intelligent, optimizing regex compiler. For example, take the following >simple regex: > > /a+bc+/ > >pregcomp will optimize that by searching for a 'b' and working outwards >both ways from there. (Actually, it might search for 'abc' and work >from there; I'm not really sure.) Without considering pregcomp's >optimizations, that RE is pretty easy to write in Parrot: > >RE_0: > #/a+bc+/ > rx_minlength 3 > branch $start >$advance: > rx_advance $fail >$start: > rx_literal P0, "a", $advance >$a_loop: > rx_literal P0, "a", $b > rx_pushindex P0 > branch $a_loop >$a_back: > rx_popindex P0, $advance >$b: > rx_literal P0, "bc", $a_back >$c_loop: > rx_literal P0, "c", $succeed > branch $c_loop >$succeed: > rx_succeed P0 >$fail: > rx_fail P0
Before we go down that road for good, may I draw your attention to the principle of a regex matcher that appeared in an article in DDJ a few years ago? The name was "Grouse Grep", it has a website (<http://www.grouse.com.au/ggrep/>), but I think (I'm not absolutely sure) that the latest version has stepped away from the principle that made the original interesting. That principle was, in a nutshell: implement each character class as a jump table. I guess that is unfeasable for Unicode, but for 8 bit characters it's easy to do. What you need is a table of 256 entries for each state, and using the next character in the input stream, you simply look up the address of the next state in the lookup table. Yes, that means that you *always* use character classes, even if just to match one literal character. The original used i386 machine code to do it, (I think it was a combination of LODSB to fetch the byte, and XLAT to look up the lowest byte of the address in the jump table), but I would think that jumping through a lookup table should be fairly easy to implement on top of Parrot VM instructions. The DDJ article appeared in november 1997, but it looks like it's not online. (The table of contents for that issue is at <http://www.ddj.com/articles/1997/9711/>) (Grouse Grep 2 appears to be released under the GNU license, but I wouldn't *use* the code, only re-implement the idea.) -- Bart.