I applied the changes to the code, using capture for the initial strip. I did use \> instead of <gt> but I didn't notice any real difference, even when I profiled it. For the matching, using a capturing regex didn't work well because it'd have to backtrace, which slowed it down too much for the simplicity. I just stuck to the coroutine. Commented out is code to use capturing regex to do it for the final substitution. PGE seems faster with the coroutine.

There's a marked improvement in speed. The one benchmark file that took 13 minutes, now gets in under 8. I haven't tried the full data yet, which is a file five times larger.

This is my first real attempt at anything to do with perl6 rules. I'm just learning as I go, and using synopsis 5 for reference.

Attachment: regexdna.pir
Description: Binary data



On Dec 16, 2005, at 11:35 PM, Patrick R. Michaud wrote:

I don't know all of the details and restrictions of the benchmark,
and I'll be the first to claim that PGE can be slow at times (it has very
few optimizations built-in).  But we may have a few tricks available
to try.

First, note that <gt> is a subrule and subrules involve extra
subroutine call overhead (with a lot of setup and take-down).
Using C<< \> >> should be much much much faster, as it's a simple
string comparison.

Instead of repeatedly calling the pattern via "next", I'd
just use an quantified capture and get all of the things to be
stripped all at once.  Thus perhaps something like:

    pattern = '[ ( [ \> \N*: ] \n ) | \N*: (\n) ]*'
    rulesub = p6rule_compile(pattern)
    match = rulesub(seq)

This gives us a single match object, with match[0] as an array
of the captured portions.  We can then just walk through the
captured portions (in reverse order) and remove the substrings--
something like:

    .local pmc capt
    capt = match[0]            # capt is an array of Match
  stripfind:
    unless capt goto endstripfind
    $P0 = pop capt             # remove last capture
    $I0 = $P0."from"()         # get starting pos
    $I1 = $P0."to"()           # get ending pos
    $I1 -= $I0                 # convert to length
    substr seq, $I0, $I1, ''   # remove unwanted portion
    goto stripfind
  endstripfind:

Hope this helps at least a little bit.  It's still likely
to be somewhat slow.  We may also be able to get some improvements
by implementing the :g modifier for the repeated captures, and
being able to compile (or use) whole substitutions as opposed to
just rules.

Pm

Reply via email to