Re: Benchmarking the proposed RE engine

Bart Lateur Mon, 26 Nov 2001 04:20:40 -0800

On Sun, 25 Nov 2001 19:34:15 -0800, Brent Dax wrote:

>Perl 5's REs will always appear faster because Perl 5 has an
>intelligent, optimizing regex compiler.  For example, take the following
>simple regex:
>
>       /a+bc+/
>
>pregcomp will optimize that by searching for a 'b' and working outwards
>both ways from there.  (Actually, it might search for 'abc' and work
>from there; I'm not really sure.)  Without considering pregcomp's
>optimizations, that RE is pretty easy to write in Parrot:
>
>RE_0:
>       #/a+bc+/
>       rx_minlength 3
>       branch $start
>$advance:
>       rx_advance $fail
>$start:
>       rx_literal P0, "a", $advance
>$a_loop:
>       rx_literal P0, "a", $b
>       rx_pushindex P0
>       branch $a_loop
>$a_back:
>       rx_popindex P0, $advance
>$b:
>       rx_literal P0, "bc", $a_back
>$c_loop:
>       rx_literal P0, "c", $succeed
>       branch $c_loop
>$succeed:
>       rx_succeed P0
>$fail:
>       rx_fail P0


Before we go down that road for good, may I draw your attention to the
principle of a regex matcher that appeared in an article in DDJ a few
years ago? The name was "Grouse Grep", it has a website
(<http://www.grouse.com.au/ggrep/>), but I think (I'm not absolutely
sure) that the latest version has stepped away from the principle that
made the original interesting.

That principle was, in a nutshell: implement each character class as a
jump table.

I guess that is unfeasable for Unicode, but for 8 bit characters it's
easy to do.

What you need is a table of 256 entries for each state, and using the
next character in the input stream, you simply look up the address of
the next state in the lookup table. Yes, that means that you *always*
use character classes, even if just to match one literal character.

The original used i386 machine code to do it, (I think it was a
combination of LODSB to fetch the byte, and XLAT to look up the lowest
byte of the address in the jump table), but I would think that jumping
through a lookup table should be fairly easy to implement on top of
Parrot VM instructions.

The DDJ article appeared in november 1997, but it looks like it's not
online. (The table of contents for that issue is at
<http://www.ddj.com/articles/1997/9711/>)

(Grouse Grep 2 appears to be released under the GNU license, but I
wouldn't *use* the code, only re-implement the idea.)

-- 
        Bart.

Re: Benchmarking the proposed RE engine

Reply via email to