Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

Peter Heslin Thu, 31 Aug 2000 14:41:41 -0700
On Wed, Aug 30, 2000 at 11:54:29PM -0400, Mark-Jason Dominus wrote:
> 
> The big thing I find missing from this RFC is compelling examples.
> You are proposing a major change to the regex engine but you only have
> two examples.  Both involve only fixed strings and one of them is
> artificial.  I really think you need to discuss in more detail why
> this feature would be useful.

Yes, I see now, after going back and rereading the RFC that I jumped
straight to my solution, and never made quite clear what the problem
is that my proposal is meant to solve.

Simply put, I want variable-length lookbehind.  I thought of ways to
implement this, and the RFC was the result.  I still think that my idea
is the most general, flexible, and straightforward to implement a
solution, but I will revise the RFC to make it clear why I want this
capability.

> 
> You specifically said that you wanted your feature to be able to match
> expressions other than fixed strings, but you didn't give any examples
> of that.
> 

As you say, the fact that I only gave examples where the regexp is a
fixed string did not help to make it clear that my aim is to implement a
general solution to the problem of variable-length lookbehind.  I will
fix this.

> OK, now here it's not really clear why you would want to use your
> feature instead of doing something like this instead:
> 
>         while (m/GAAC/g) {      
>           last if substr($_, pos($_)-5, 5) eq 'GAATT';
>           last if ...;
>           ...;
>         }
> 
> You could make an argument that yours is more compact, but my version
> it could easily be wrapped into a subroutine, and it doesn't seem like
> a particularly common operation, so it doesn't seem like there needs
> to be another way to say this.  Of course, I might have completely
> missed the point.  More and better examples would be a great help
> here.

Your code is an alternative way of expressing what I meant when I wrote:

        you could (a) code the whole match, including all the possibile
        prefixes, into the regexp or (b) use GAAC as the match string and
        then select the particular hits you want by positive and negative
        lookbehind. 

And it has the same problem: only works for fixed-length strings, and
you have to enumerate each one separately.  

I would dispute the assertion that "it doesn't seem like a
particularly common operation".  If it was deemed worth hacking
fixed-length lookbehind into the Perl5 regexp engine, then surely it
would be worth designing a place for variable-length lookbehind into the
new Perl6 engine, performance considerations permitting (on which, see
below).

> > As a frivolous illustration, the string 
> > 
> >     ABCDEFGHIJKLM
> > 
> > would be matched by:
> > 
> >     m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/
> 
> If I understand your proposal correctly, it will not change the
> behavior of the regex if you collect the (?f) and (/r) sesctions
> together.  If this is true, then these all have the same meaning:
> 
>      m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/       # Your example
>      m/FGHIJK(?r)EDCB(?r)A^(?f)LM$/
>      m/FGHIJK(?r)EDCBA^(?f)LM$/
>      m/FGHIJKLM$(?r)EDCBA^/                   # Why not just say this?
> 
> If I am correct, then it doesn't appear that there is ever any reason
> to have more than one (?r) and one (?f) in a single regex.  Also,
> since there is in effect an implicit (?f) at the beginning of every
> regex, you don't need a (?f) escape at all, as in the example I just
> showed.  

This is true, except possibly for motives of performance optimization,
which, as you point out below, is fruitless to rely on at this stage.
Nevertheless, if you imagine that you replace the letter `H' in those
examples with a really nasty, long and complex bit of a regex, then is
it not possible that you might want the match as a whole to fail if
it fails to match EDCB, before even trying the part of the regexp in the
position of `H'?

> 
> Did I misunderstand your proposal?  Or did I miss seeing the
> implication of some example that you didn't include?  If I am correct,
> I think you should eliminate (?f) from your proposal, since it is not
> useful.

I will point out that it is not strictly necessary, but that there
yet might be reasons why it would be convenient to have it.

> 
> > It will be important to know the offset where the match begins, as
> > well as where it ends (indeed it would be nice to have that info in
> > Perl5 without having to pay the C<length $&> performance penalty),
> > so in addition to C<pos>, there might be a function C<prepos> to
> > give the start of the match -- or C<pos> might return both end and
> > start offsets in a list context.
> 
> OK, that's very nice, but you say you don't want the $& penalty.
> I suspect from your discussion that you don't really understand that
> $& penalty.  There are two parts to the $& penalty.
> 
> The first part is that maintaining the information for $& has a cost.
> Maintaining this information for your prepos() function is going to
> incur an identical cost.

Why?  Surely the cost of maintaining the info for prepos would be the
same as the cost for maintaining the info for pos, which is maintained
by every match anyway.  All you want is the offset into the target
where the match started, not the contents of the match, and thus there
is no need to copy any part of the target string.

In another post in this thread, mike mulligan said much the same thing,
and you found the fact that he questioned your insistence that the
regexp engine should have to act this way "bizarre" -- but I too do
not understand how it is necessary that some future regexp code should
not be able to tell you the offset where the match began as well as
where it ended without incurring massive penalties.

> 
> The other part of the $& penalty is because $& itself is a global
> variable, the penalty has to be paid by every regex in the program.
> This is not a problem with the information in $&; it is a problem with
> the interface to the information.  If the interface were different, $&
> would not be a problem.  For example, if $& were only set on regexes
> with a /k modifier, as proposed in RFC158, a lot of the pain of $&
> would go away.
> 
> Now if something like RFC158 were adopted, then your rationale for
> prepos() would go away, because length($&) would no longer be
> particularly expensive.  At least, there would be no reason to suppose
> it would be more expensive than your proposal.

Yes.  I will refer to RFC158 in the next version of RFC72.

> 
> However, a prepos() function had exactly the same problem as $&
> presently has.  Whenever Perl did a regex match on any regex in the
> entire program, it would have no way of knowing whether prepos() might
> be called much later, so the cost of computing and storing the
> prepos() information would be incurred.    
> 
> Rather than evading the $& problem, as you suggest, introducing
> prepos() is going to make it even worse.

I do not have my copy of Friedl's book handy, and I cannot remember the
precise reasons he gave for why the $& peanalty exists, except for a
vague recollection that it had to do with copying part of the target
string, so you may well be right, but I do not see why.  All I am asking
for is an interface to the offset where the present match started.  If
the regexp engine is to be redesigned, then surely (?) this could be
provided with no performance cost.  

> 
> You can evade this problem by making prepos() lexically scoped.  For
> example, prepos() information is only computed for regexes that have
> the /q modifier on the end, or is only available inside the scope of a
> 'use prepos' declaration.  Either of these would fix this problem.
> 
> > I have no idea whether this feature will help people parsing right-to-left
> > languages; it seems likely to help with bi-directional texts (see RFC 50).
> 
> I was wondering that myself, but I don't think it will, because RTL
> text is not encoded backwards in the string itself.  It only *prints*
> right-to-left.  But I may be mistaken, and I think you should consult
> with Roman Parparov on this point before submitting the next revision
> of this RFC.

I think I will just delete the suggestion.

> 
> Finally, some general comments: First, it seems to me that if there
> were simply a better interface to pos() and to length($&), the need
> for this feature would go away.  Let's suppose that there *was* a
> list-context pos() function like the one you propose.
> 
> Then you can get the effect of /FOO(?r)BAR/ this way:
> 
>         while ($s =~ /FOO/g) {
>           my ($start, $end) = pos($s);
>           if (substr($s, 0, $start) =~ /RAB$/) {
>             # /FOO(?r)BAR/ would have been true
>           }
>         }
> 
> Note that this works for arbitrary patterns FOO and BAR, not just for
> fixed strings.

Yes indeed. Provided that one can get the prepos info, this code
accomplishes exactly what I am trying to do.  So does that make my
proposal redundant?  Perhaps.  But in addition to making redundant the
addition of variable-length lookbehind, you have also made redundant the
present inclusion of fixed-length lookbehind.  So should we remove that?
Should we remove all rexexp extensions whose functionality can be
emulated in Perl code?  If that is the razor we use, we'll have bloody
few of the present features left.

In the application I have written that makes extensive use of
lookbehind, regexps are generated on the fly from user input.  So in
order to implement your solution, I will not only have to generate the
regexps, I will also have to generate Perl code on the fly for each
search term, and then eval that.  If there is a way of implementing
variable-length lookbehind in the regexp engine without slowing the
whole thing down, would this not be a natural extension of present
regexp functionality?

> 
> Second, you make some argument (which I didn't quite follow) about
> optimization and the speed of the regex engine.  This is very shaky
> ground.  

Yes, you are quite right.  It was foolish of me to speak of the
performance of vapourware.  And yet ...

> That is like saying that when move your Grandma moves house,
> you are going to transport her belongings in a race car, to take
> advantage of its speed.  But by the time you finish loading Grandma's
> paraphernalia onto the race car, it is no longer fast.  The regex
> engine is very highly optimized, and at best (?r) and (?f) are likely
> to defeat many of the optimizations that make the regex engine fast to
> begin with.  (/i does this also.)   And at worst, making the regex
> engine general enough to support this feature might make it much
> slower even for patterns that don't use the feature.  The inner loop
> of the regex has a pointer to the current character position in the
> string.  The inner loop is full of s++ expressions that advance the
> pointer one character at a time.  If the engine has to match backwards
> also, these s++'es are all going to have to change to s+=d's or (d ?
> s++ : s--)'es or some such.  

 ... you go on to criticize the expected performance of an
unimplemented feature in an unwritten regexp engine on the basis of the
current code.  Will that inner loop with its s++ expressions look the
same in its next incarnation?  I was under the impression that one of
the goals of Perl6 was to include Unicode support at a fundamental
level.  If so, will the regexp engine still be optimized for 8-bit text?
Perhaps s++ will have to change into s+=d anyway, in which case will it
matter if d is a negative integer?  Will UTF-8 require even more
complexity in order to get the next char?

As to your contention that "at best" (?r) will defeat many present
optimizations, can you tell me why this will necessarily be so in the
new engine?  I freely admit that there may be gremlins that I have not
thought of, and that some of these may be insurmountable, but is Perl6
not meant to be a rewrite of Perl?  The reason I made this proposal now
is that it is not the sort of thing that could be easily hacked on to
existing code.  It may be, however, that if those writing the new engine
keep in mind the desireability of running backwards along the target
string, they may find that accommodation can be made for this without
sacrificing performance in normal cases.

> 
> I am not saying that it could not be done, but if I were you I would
> be very reluctant to make an argument from performance without talking
> to someone with some expertise first.  

It was precisely to get the feedback of someone with expertise that I
submitted the RFC.

On a more general note, I thought that it went without saying that *all*
of the RFC proposals come with an implicit admission that if they prove
to be too difficult to implement or too costly in terms of performance,
then they go out the window.  

If someone provides a clear reason why this proposal would have to
hurt regexp performance for normal cases in a rewritten regexp engine,
then I will withdraw it at once.  But perhaps you or Mr. Marb Groddinus
can tell me which would be less bad:

1) To pass on suggestions to those writing Perl6 which turn out to
be impractical to implement and in the end have to be removed,

or

2) To remove Perl6 features from consideration on the basis of the
(possibly erroneous) assumption that they will necessarily hurt overall
performance.

> Perhaps Hugo van der Sanden
> would be willing to discuss this with you in more detail?

I am not acquainted with the gentleman you name.  Please do solicit
the input of others you know who might be interested.

> 
> I hope you find these remarks helpful.

Yes, very much so.  I will remove the stuff about performance from the
RFC, except insofar as it is necessary to explain the potential
usefulness of (?f).  I will try to make it much clearer that what I am
aiming for is a clean, clear and implementable solution to the problem
of variable-length lookbehind, and I will make the examples reflect that.

Thanks for the feedback,

Peter
Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

Reply via email to