Re: Regex bytecode bakeoff

Steve Fink Tue, 07 May 2002 09:47:13 -0700

On Mon, May 06, 2002 at 08:44:03PM -0700, Mark Kvale wrote:
> 
> To compare schemes (A), (B), and (C), I implemented each of them. I
> created a common parser for the three schemes that recognizes the
> basics: ^, $, ., a, \a, [a], [^a], ab, a|b, a*, a+, a? and (a).  The
> parsed regex can be converted to one of three types of bytecode: 
> 
>    A. 'reg'  - Parrot bytecode, just core ops
>    B. 'rx'   - Parrot bytecode, core and rx ops


I haven't tested the rx backend in a while, but the Regex package in
languages/regex also supports multiple backends. At one point, it
produced three different flavors: rx opcodes, regular parrot opcodes,
and my own re opcodes. That last was the simplest in terms of code
generation and probably the smallest bytecode, but it consumed much
more memory at runtime, so I abandoned it in favor of the other two
(basically, I abandoned implicit backtracking in favor of explicit
backtracking as in Brent Dax's code.)

>    C. 'spen' - a specialized bytecode used by the execution engine in
>                Henry Spencer's original regex package. His package
>                formed the basis for perl5 regex engine.
> 
> For the reg backend, I made use of assembly idioms used in Steve
> Fink's Regex package. For the rx backend I used idioms suggested by
> the rx pod.

And the Regex package stole most of its idioms from Brent's rx pod.
:-) The rx pod has a couple of bugs, however. I should write those up
sometime.

> For the spen backend, I generated bytecode that exactly
> matched Spencer's bytecode, so that I could use his regexec routines
> directly for execution.  Although not entirely bug-free, all three
> backends pass the match/no-match portion of the Spencer test suite, so
> they have a basic level of functionality. The code is available at
> 
>    http://keck.ucsf.edu/~kvale/simple_rx.tar.gz
> 
> and is meant to be unpacked in the languages/ directory.

Have you tried using languages/regex/regex.pl and comparing the
generated code? (You can disable either or both optimization steps
pretty easily if you like.) The only thing it's missing from your list
is character classes (since I'm still of the opinion that those need
to be implemented in the string layer.) And it leaves a ton of crap on
the interpreter's user stack, since that's what it uses for
backtracking. I could add a simple cleanup now, but it would be slow
-- it would pop things off until the stack depth returned to what it
was upon entry to the regex subroutine.

> Execution time is perhaps the most crucial metric.  perl5 excels at
> searching large amounts of text quickly; most users would not want to
> give that up.
> 
> We'll model the execution time of a regex test as follows:
> 
>    t[i] = co + n * (lo + r[i])
> 
> with
> 
>    t[i] = execution time for the whole test of regex i
>      co = a constant overhead, approximately independent of regex
>       n = number of repetitions of the match
>      lo = an overhead per match, approximately independent of regex
>    r[i] = time taken by just the regex code for a single match
> 
> co is dependent on the test harness used and is uninteresting.  lo
> takes into account the allocation, initialization and cleanup time
> needed for any single regex match and is relevant for us; each regex
> takes at the very least lo time. r[i] is the time taken to do the
> actual match against a string and is dependent on both the string and
> the regex.
> 
> We can solve for co by testing against a fixed regex and string, and
> varying the number fo repetitions. Given co, we can solve for lo by
> keeping repetitions fixed and varying the size of a repetitive regex
> and string. For instance, in each of the backends, matching regex
> 'aaa' against string 'aaa' will take about three times as long as
> matching regex 'a' against string 'a'.  For a regex-string combination
> that can be made repetitive, the model becomes
> 
>    t[i] = co + n * (lo + m*ro[i])
> 
> with m the number repetitive elements in the regex and ro[i] the per
> unit matching time.

How well do these models explain the actual runtime? I'm just curious
if you've done an analysis of variance to see how good the fit is.

> Matching regex 'a' against string 'a' we get the following times on my
> Lintel notebook: (All times are in seconds)
> 
> "a" =~ /a/:
> 
> backend   lo            ro             lo/ro     ro/spen ro
> 
> reg:      4.518e-06     1.935e-07      23.3      12.8
> rx:       1.621e-06     5.387e-07       3.0      35.6
> spen:     5.477e-07     1.509e-08      36.2       1.0
> perl:     7.897e-07     1.267e-08      62.3       0.84
> 
> The fixed, per match overheard lo is 3 times larger for rx than spen
> and 8 times larger for reg than spen. The per character match time ro
> is 36 times larger for rx than for spen and 13 times larger for reg
> than for spen. That the per character time for rx is so large seems
> counterintuitive to me. On the other hand, the rx_literal code
> computes dynamically several functions, such as length of the regex
> match string, that are precomputed in the reg code...

string_length really ought to be an inline function.

When I was benchmarking engines, I found that the rx engine spends a
*lot* of time transcoding the input string to UTF-32. It would be
interesting to see the numbers with it disabled (it should still work
fine without it.)

> In my opinion, however, factors of 3-10 or more in execution time will
> be hard to overcome. Both the code size and execution time data
> strongly favor scheme (C), building an independent specialized
> bytecode engine into parrot.

I assume you're using gcc, so you're getting the computed goto runops
core. It would be interesting to see timings for all the runops
variants available to you: regular, computed goto, prederef, and jit.
I'll dust off my patches that allow you to switch between all of them
with command-line options.

Good stuff, though! It's great to see some real numbers.

One of the things I really wanted to try is a sort of hybrid NFA/DFA
mode -- identify the maximal chunks of a regex that will produce the
same results with a DFA as they would with a traditional NFA, and
produce a table-driven state machine for them. Then implement the
state machine traversal either in parrot or in a single
match-with-table-until-done op (which would be going towards more of
the specialized engine approach.) I have a little bit of discussion of
this in languages/regex/README. I like this optimization because it
replaces a whole wad of more special-case optimizations that I feel
are the bane of perl5's engine.

But I won't say that publicly until I've proven it by implementing it
and benchmarking it.

Oops.

Re: Regex bytecode bakeoff

Reply via email to