=head1 TITLE Regex assertions in plain Perl code =head1 VERSION Maintainer: Bart Lateur <[EMAIL PROTECTED]> Date: 28 Sep 2000 Mailing List: <[EMAIL PROTECTED]> Number: 348 Version: 2 Status: Developing (candidate for freeze) =head1 ABSTRACT Likely the most justifiable reason to want to be able execute embedded Perl code while trying to match a regex pattern, is to include some extra tests on the data just matched. The fact that the current implementation of the "experimental" /(?{...})/ construct, doesn't do just that, feels like a design mistake. So the proposal is to drop the current implementation of (?{...}) in favour of code assertions. =head1 CHANGES =over 2 =item * Removed stress on deleting "local" feature =item * Added examples on why other "advanced" features are or aren't indispensable =back =head1 DESCRIPTION Likely the most justifiable to want to be able to execute Perl code in a regex, is to have additional checks, to see if what you just matched is indeed acceptable. Quite a few prominent Perl people have expressed having been unpleasantly surprised when they found out that the current implementation of /?{...}/ doesn't do that. Assertions are already familiar to people writing regexes: for example, /\b/, lookahead and lookbehind, and anchors, are all assertions, but only in matching subpatterns, not in code. Adding the option to do similar tests in code, seems like a powerful addition, while maintaining the basic spirit of regexes. The main problem with the current implementation of (?{...}) is that it "always succeeds". In case of an assertion, it is the outcome of the execution of the embedded code, that decides if the test succeeds (return value is true), meaning it is safe to continue; or fails (return value is false), in which case the regex engine should abort exploring this branch, and immediately backtrack. =head2 example RFC 197 proposes a specific syntactic addition to regexes, just to check if a number is in a specific range. This is just one of the many things that could easily achieved using an assertion: /(\d+)(?{$1<256})/ # proposed syntax, recycling Perl5's syntax If your string = "1234", the subpattern /(\d+)/ initially matches all digits, stuffs them in $1, and calls the assertion code. This code additionally if what was matched is within range, in this case, less than 256. If this fails, the submatch fails. This feature might result in matching some unexpected substrings: both "123 and "234" are acceptable matches according to this assertion. In order only to match what you really want to match, you may have to add some anchors, lookahead and/or lookbehind. So it will take some getting used to. This does not make it less valuable. =head2 Getting rid of the current (?{...}) construct The current implementation for embedding code in regexes, is not aimed at assertions. Instead, it's only useable for its "effect at a distance", which is simply horrible. Basically, it simply provides a means to pass data around between various parts of the regex. This is a unhealty situation, since it makes regexes that use it, look incredibly obfuscated, and it requires deep knowledge on how a match in a regex proceeds. In addition, it promotes changing global data structures. Since the code is executed everytime something promising is matched, and not just after a complete match, the code will probably be executed more often than desired. That is the reason why Perl5 has built-in support to I<temporarily> modify global data structures, so that the effect can automatically be undone when backtracking. This makes the implementation very tricky. I wouldn't be surprised if precisely this feature is the main reason why the current implementation is so notoriously unstable. Richard Proctor even wants to go further still: in his RFC 274, "Generalised Additions to Regexs", he proposes to add support for executing some embedded Perl code only while backtracking, specifically to undo changes by hand. I think that this would make the situation even worse than it is today. Even knowing when and why the embedded code is executed, is far from obvious. Take this example: $_ = "SKIP buzzer"; if(/(?{print "Testing\n";})([a-z])\1(?{print "Got a match: $1\n"})/) { print "YES\n"; } else { print "NO\n"; } This prints: Testing Testing Testing Got a match: z YES The fact that the embedded code is called 3 times, not more, surely suprised me. It probably will surprise many people. Apparently, it is only executed once for every lowercase letter, not just for any character. This inpredictability is yet another reason to discourage incrementally modify global data structures. =head2 enter assertions The above considerations, which are annoying at least for executing embedded generic code, are not hindrances at all for processing assertions. In spirit, assertions are intended to have a local effect only, which is only used immediately to direct processing of a match. All it is supposed to do, is optionally veto a branch. It won't have an effect on some other bizarre location in the regex. There isn't a secret variable where the result is stored, to be used elsewhere. You're not supposed to change global data structures. Therefore, we don't really need this localization. You basically don't need to undo anything. If you insist on modifying global data, you still can; only: you're on your own. =head2 assertions in Perl5 The current implementation of the /(?{code})/ construct makes simple assertion tests using embedded code hard, but not impossible. As MJD has pointed out: "It is not pretty". =over 2 =item assertion (?{COND}) =item assertion in Perl5 (?(?{not COND})(?!)) (?(?{not do { COND }})(?!)) (?(?{COND})|(?!)) =back This makes assertion look like a magic incantation, not something trivial, as it should have been. =head1 OTHER EXPERIMENTAL ADVANCED FEATURES =head2 local In the spirit of assertions, C<local> is not necessary. It might just as well be removed. I think it would benefit ease of iumplementation quite a bit, as I think it is likely the most complex part of it. The behaviour for C<local>, in the current embedded code, is rather unintuitive. In the Perl5 code /(?{local $c = $c + 1})/ it looks as if the scope for the localized variable should be scoped to the embedded code block. It is not. Instead, the value is maintained, permanently, unless the regex backtracks, in which case a previous value, from the part of the regex just in front of the backtracking, is restored. Are you still with me? I thought not. =head2 /(?(condition)yes-pattern|no-pattern)/ This thing becomes largely unnecessary. First of all, its current main purpose, I think, is emulating assertions. Second, there is nothing you can do with it, that you can't do without it. For example, this regex: /$pre(?(?=$lookahead)($pattern))$post/ behaves exactly like: /$pre((?!$lookahead)|($pattern))$post/ i.e. if the lookahead succeeds, the pattern needs to be matched, otherwise it is skipped. In the latter, it is: if the negative lookahaed succeeds, nothing needs to be matched, otherwise the pattern must be matched. The IF/ELSE case is slightly more complex. This: /$pre(?(?=$lookahead)($if)|($else))$post/ can become: /$pre((?=$lookahead)($if)|($?!$lookahead)($else))$post/ I see no reason to hang on to this feature. =head2 /(??{ code })/ I do not wish to express an opinion on whether this one should stay. Its main purpose is to allow matching of nested structures, i.e. a grammar instead of a regex. In that, it is not related to assertions in any way. =head1 IMPLEMENTATION Replace the old implementation of (?{...}). Optionally, throw out all the "localizing" code. Do not set any special variables. The new feature should: =over 2 =item * have a zero width impact on matching in the regex =item * either succeed of fail, depending on the returned value as a boolean scalar: true means it's ok to continue, false means a veto, causing the regex to immediately backtrack. =back =head1 MIGRATION I have been told that people in general do not use the current implementation of (?{...}) in production code, because it tends to dump core. Also, perlre calls this feature "highly experimental, and [it] may be changed or deleted without notice". You had been warned. ;-) You can assign any value you want to $^R or any other variable, if you wish, and make the assertion succeed by adding a constant true value at the end of the code. This can be done automatically: =over 2 =item perl5 (?{ CODE }) =item assertion (?{$^R = do { CODE }; 1}) =back This may require an extra "local", if that feature is to be retained. =head2 /(?(?{COND})yes-pattern|no-pattern)/ If conditional regexes are to stay, oddly enough, then for such a regex where the condition is an assertion in Perl code, the code doesn't need to change at all. Indeed, this is one case of where C<(?{...})> currently indeed behaves as an assertion; $^R doesn't get set, either. In case this feature is deleted, see possible substitutions in the discussion above. =head1 REFERENCES thread on perl-5-porters, started by Jeffrey Friedl: "how to abort a match branch?" <http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-08/msg00252.html> <http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-08/msg00268.html> relevant messages from Mark-Jason Dominus: <http://www.mail-archive.com/perl6-language-regex@perl.org/msg00102.html> <http://www.mail-archive.com/perl6-language-regex@perl.org/msg00262.html> <http://www.mail-archive.com/perl6-language-regex@perl.org/msg00306.html> It looks like in the first message, MJD mistook the current (?{...}) for an assertion. In a later message, he no longer made this mistake, and introduced the "unpretty" syntax for doing assertions in Perl5. RFC 274: Generalised Additions to Regexs (see: the suggestion to add even more support for localization and unwinding) RFC 308: Ban Perl hooks into regexes perlre man page, sections on /(?{ code })/ and on /(?(condition)yes-pattern|no-pattern)/ and /(?(condition)yes-pattern)/ -- Bart.