This and other RFCs are available on the web at
http://dev.perl.org/rfc/
=head1 TITLE
Regex assertions in plain Perl code
=head1 VERSION
Maintainer: Bart Lateur <[EMAIL PROTECTED]>
Date: 28 Sep 2000
Mailing List: [EMAIL PROTECTED]
Number: 348
Version: 1
Status: Developing
=head1 ABSTRACT
Likely the most justifiable reason to want to be able execute embedded
Perl code while trying to match a regex pattern, is to include some
extra tests on the data just matched. The fact that the current
implementation of the "experimental" /(?{...})/ construct, doesn't do
just that, feels like a design mistake. So the proposal is to drop the
current implementation of (?{...}) in favour of code assertions.
=head1 DESCRIPTION
Likely the most justifiable to want to be able to execute Perl code in a
regex, is to have additional checks, to see if what you just matched is
indeed acceptable. Quite a few prominent Perl people have expressed
having been unpleasantly surprised when they found out that the current
implementation of /?{...}/ doesn't do that.
Assertions are already familiar to people writing regexes: for example,
/\b/, lookahead and lookbehind, and anchors, are all assertions, but
only in matching subpatterns, not in code. Adding the option to do
similar tests in code, seems like a powerful addition, while maintaining
the basic spirit of regexes.
The main problem with the current implementation of (?{...}) is that it
"always succeeds". In case of an assertion, it is the outcome of the
execution of the embedded code, that decides if the test succeeds
(return value is true), meaning it is safe to continue; or fails (return
value is false), in which case the regex engine should abort exploring
this branch, and immediately backtrack.
=head2 example
RFC 197 proposes a specific syntactic addition to regexes, just to check
if a number is in a specific range. This is just one of the many things
that could easily achieved using an assertion:
/(\d+)(?{$1<256})/ # proposed syntax, recycling Perl5's syntax
If your string = "1234", the subpattern /(\d+)/ initially matches all
digits, stuffs them in $1, and calls the assertion code. This code
additionally if what was matched is within range, in this case, less
than 256. If this fails, the submatch fails.
This feature might result in matching some unexpected substrings: both
"123 and "234" are acceptable matches according to this assertion. In
order only to match what you really want to match, you may have to add
some anchors, lookahead and/or lookbehind. So it will take some getting
used to. This does not make it less valuable.
=head2 Getting rid of the current (?{...}) construct
The current implementation for embedding code in regexes, is not aimed
at assertions. Instead, it's only useable for its "effect at a
distance", which is simply horrible. Basically, it simply provides a
means to pass data around between various parts of the regex. This is a
unhealty situation, since it makes regexes that use it, look incredibly
obfuscated, and it requires deep knowledge on how a match in a regex
proceeds.
In addition, it promotes changing global data structures. Since the code
is executed everytime something promising is matched, and not just after
a complete match, the code will probably be executed more often than
desired.
That is the reason why Perl5 has built-in support to I<temporarily>
modify global data structures, so that the effect can automatically be
undone when backtracking. This makes the implementation very tricky. I
wouldn't be surprised if precisely this feature is the main reason why
the current implementation is so notoriously unstable.
Richard Proctor even wants to go further still: in his RFC 274,
"Generalised Additions to Regexs", he proposes to add support for
executing some embedded Perl code only while backtracking, specifically
to undo changes by hand. I think that this would make the situation even
worse than it is today.
Even knowing when and why the embedded code is executed, is far from
obvious. Take this example:
$_ = "SKIP buzzer";
if(/(?{print "Testing\n";})([a-z])\1(?{print "Got a match: $1\n"})/) {
print "YES\n";
} else {
print "NO\n";
}
This prints:
Testing
Testing
Testing
Got a match: z
YES
The fact that the embedded code is called 3 times, not more, surely
suprised me. It probably will surprise many people. Apparently, it is
only executed once for every lowercase letter, not just for any
character.
=head2 enter assertions
The above considerations, which are annoying at least for executing
embedded generic code, are not hindrances at all for processing
assertions.
In spirit, assertions are intended to have a local effect only, which is
only used immediately to direct processing of a match. All it is
supposed to do, is optionally veto a branch. It won't have an effect on
some other bizarre location in the regex. There isn't a secret variable
where the result is stored, to be used elsewhere. You're not supposed to
change global data structures, and expect this not to be permanent.
Therefore, we don't need this localization. You basically don't need to
undo anything. If you insist on modifying global data, you still can;
only: you're on your own.
=head2 assertions in Perl5
The current implementation of the /(?{code})/ construct makes simple
assertion tests using embedded code hard, but not impossible. As MJD has
pointed out: "It is not pretty".
=over 2
=item assertion
(?{COND})
=item assertion in Perl5
(?(?{not COND})(?!))
(?(?{not do { COND }})(?!))
=back
This makes assertion look like a magic incantation, not something that
should be trivial.
=head1 IMPLEMENTATION
Replace the old implementation of (?{...}). Throw out all the
"localizing" code. Do not set any special variables.
The new feature should:
=over 2
=item *
have a zero width impact on matching in the regex
=item *
either succeed of fail, depending on the returned value as a boolean
scalar: true means it's ok to continue, false means a veto, causing the
regex to immediately backtrack.
=back
Furthermore, I think that /(?(condition)yes-pattern|no-pattern)/ and
/(?(condition)yes-pattern)/, which are extremely rarely used, and
probably most of all to emulate assertions in Perl5, are far too
obscure, and should be dropped as well. I do not wish to express an
opinion on the faith of (??{...}).
=head1 MIGRATION
I have been told that people in general do not use the current
implementation of (?{...}) in production code, because it tends to dump
core. Also, perlre calls this feature "highly experimental, and [it] may
be changed or deleted without notice". You had been warned. ;-)
First of all, the perl5 -> perl6 translator should warn when it
encounters regexes where the perl5 version of this feature has been
used.
Furthermore, you can assign any value you want to $^R or any other
variable, if you wish, and make the assertion succeed by adding a
constant true value at the end of the code. This can be done
automatically:
=over 2
=item perl5
(?{ CODE })
=item assertion
(?{$^R = do { CODE }; 1})
=back
"local" inside embedded code will no longer be supported, nor will
consitional regexes. The Perl5 -> Perl6 translator should warn if it
ever encounters one of these.
=head1 REFERENCES
thread on perl-5-porters, started by Jeffrey Friedl: "how to abort a
match branch?"
<http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-08/msg00252.html>
<http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-08/msg00268.html>
relevant messages from Mark-Jason Dominus:
<http://www.mail-archive.com/perl6-language-regex@perl.org/msg00102.html>
<http://www.mail-archive.com/perl6-language-regex@perl.org/msg00262.html>
<http://www.mail-archive.com/perl6-language-regex@perl.org/msg00306.html>
It looks like in the first message, MJD mistook the current (?{...}) for
an assertion. In a later message, he no longer made this mistake, and
introduced the "unpretty" syntax for doing assertions in Perl5.
RFC 274: Generalised Additions to Regexs
(see: the suggestion to add even more support for localization and unwinding)
RFC 308: Ban Perl hooks into regexes
perlre man page, sections on /(?{ code })/ and on
/(?(condition)yes-pattern|no-pattern)/ and /(?(condition)yes-pattern)/