This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE The regexp engine should go backward as well as forward. =head1 VERSION Maintainer: Peter Heslin <[EMAIL PROTECTED]> Date: 9 Aug 2000 Version: 1 Mailing List: [EMAIL PROTECTED] Number: 72 =head1 ABSTRACT It is proposed that the regular expression engine should be designed so that, when it is in an accepting state, it will either consume the character after the end of the currently matching part of the target string, or the character just before its beginning, depending on instructions embedded in the regexp itself. =head1 DESCRIPTION The way humans scan text for a pattern is often to look for a distinctive feature of the pattern sought, and then to look at the surrounding context before and after. The Perl regular expression engine is optimized to look such fixed string landmarks in this way, but if you want to write a hand-optimized regexp that will explicitly look backward from a certain match in this fashion, your lookbehind is limited to fixed-length expressions (at least in my Perl 5.005, and I assume this true for 5.6, too). A more general approach than lookbehind might be useful. I propose two new regexp extensions which mean "jump and reverse" and "jump and go forward". These might be (?r) and (?f) or (?[) and (?]) respectively, or some other pair. It is important to note that this proposal does not suggest that the same characters in a string should be matched twice or more in a single hit by different parts of the same regexp going in different directions; that way madness lies. The "jump" part of these commands means to jump immediately to the beginning or the end of the currently matched part of the string and continue from there. In this way, the match continues to grow one character at a time, but at both ends. =head2 Usefulness The Perl regexp engine is excellent at finding fixed strings embedded in your regexp and starting from there, but, as Friedl says in his book, it is generally more efficient if you can guide a regular expression engine yourself based upon your knowledge and assumptions about the potential input. Imagine a very long input string containing data such as this: ... GCAAGAATTGAACTGTAG ... If you want to match text that includes the string GAAC, but only when it follows GAATT or any one of a large number of other different possibilities, you could (a) code the whole match, including all the possibile prefixes, into the regexp or (b) use GAAC as the match string and then select the particular hits you want by positive and negative lookbehind. For some large amounts of data and very complex matching requirements that come before any fixed string, Perl's optimizer can't do as well as a human and option (b) can (in my experience, anyway) be an order of magnitude faster. The problem is that with only fixed-length lookbehind available this is often not a practical option. With the proposed extension, you could write: m/GAAC(?r)(TTAAG| .... )/ and the regexp engine doesn't have to go looking deep into your regexp to know where it should start potential matches. As a frivolous illustration, the string ABCDEFGHIJKLM would be matched by: m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/ =head2 Related Features It will be important to know the offset where the match begins, as well as where it ends (indeed it would be nice to have that info in Perl5 without having to pay the C<length $&> performance penalty), so in addition to C<pos>, there might be a function C<prepos> to give the start of the match -- or C<pos> might return both end and start offsets in a list context. It would be very nice if C<(?r)> would do something usefully optimized if it appeared at the start of a regexp: C</$(?r).*CBA/> would then start matching at the end of the string and proceed rapidly towards the front, until it hit "ABC". This could be particularly useful in conjunction with C<Backwards.pm>. It would be extra work, however, as not only would the regexp engine itself have to be modified to switch direction, but the code that "cheats" by looking for likely places to start the engine (which presumably uses a Boyer-Moore or similar algorithm) would have to run backward, too. I have no idea whether this feature will help people parsing right-to-left languages; it seems likely to help with bi-directional texts (see RFC 50). =head1 IMPLEMENTATION It may already be clear that the author of this RFC has no idea of how the Perl regexp engine is implemented. I have never even looked at the code of this or any other regexp engine, nor am I competent to do so. It may well be, therefore, that what I propose is completely infeasible. If so, and those who are actually going to code this stuff say so, I will promptly withdraw this RFC, which in the cirumstances may turn out to have been a Request For Crucifixion. Here are some potential implementation issues: =head2 Backtracking Presumably, the regexp engine maintains something like a stack which is popped when a prospective match fails and it has to fall back to an earlier possibility. If so, when we jump and switch direction, we will have to push an indication of this onto the stack, so that when it is popped again we know to jump and switch the other way when backtracking off a failed match. =head2 Backreferences The text captured in parentheses while (?r) is operative will be reversed with respect to its order in the target string. =head2 Multiple matching The C</g> modifier would behave just as it does currently: the next attempt begins the character after the end of the previous match. If the previous example of C</$(?r).*CBA/> were extended so that one might write C</(?r).*CBA/> (without the anchor at the start), then we would want in this case (i.e. when the regexp begins with (?r)) to make an exception to normal behavior and have the next match begin the character before the beginning of the previous one. =head1 REFERENCES _Mastering Regular Expressions_, Jeffrey Friedl (O'Reilly) RFC 50 (BiDirectional Support in PERL)