From: Mark-Jason Dominus <[EMAIL PROTECTED]>
Sent: Wednesday, August 30, 2000 11:54 PM


> There are two parts to the $& penalty.
> The first part [ of $& penalty is ] maintaining the information for $&.
> Maintaining this information for your prepos() function is going to
> incur an identical cost.

I am unemcumbered by any knowledge of the regex engine implementation, but
the naive assumption is that it could be cheaper to keep a starting match
position, than it is to allocate the memory and copy the matching
characters.

The difference is trivial for smaller strings, but might become significant
as string length increases.  The RFC's example may have come from the human
genome project, which I believe deals with longish strings.

> The big thing I find missing from this RFC is compelling examples.

Reading between the lines of the RFC, I think the rationale is performance
and optimization.  In particular, why shouldn't:

  $dna_string =~ /(GAAC)+$/

be as fast as:

  $dna_string =~ /^(GAAC)+/

> ... making the regex engine general enough to support this feature
> might make it much slower even for patterns that don't use the feature.
> ...  The inner loop is full of s++ expressions that advance the pointer
> one character at a time.  If the engine has to match backwards
> also, these s++'es are all going to have to change to s+=d's or (d ?
> s++ : s--)'es or some such.

I agree that the regex engine performance shouldn't be allowed to degrade.
It might be possible to unroll this imagined inner test outside the loop -
in other words, much of the "advance the pointer" code in the regex gets
duplicated, once for going forwards and once for going backwards.

  mike mulligan

Reply via email to