Re: [PHP-DEV] [VOTE][RFC] Unicode Codepoint Escape Syntax

mario Mon, 08 Dec 2014 23:08:23 -0800

Tue, 9 Dec 2014 02:44:33 +0000 Andrea Faulds <a...@ajf.me>:
>
> Well, PCRE does what it does probably because of its name:
> *Perl-Compatible* Regular Expressions. Perl has the \x syntax. But
> PCRE’s syntax comes from what suits Perl, not PHP, so I don’t see why
> we should necessarily match its behaviour. If we add \x{xxxxx} syntax
> to PHP’s string literals, then we’ll break existing code which uses
> double quoted strings for regular expressions.


Actually the opposite seems alarming. For double quoted strings it'd be
irrelevant if \u{} or \x{} was priorly handled by PHP or left to PCREs
interpretation. (Having an alternative there is even beneficial.)

However, single-quoted strings are more commonly and habitually used for
regexps. And with \u{} going to be used regularily, then unknowingly or
accidentially in regex context, is where it would trigger PCRE failures.

    preg_match('~\u{bad}~umixUs')

Both \u{} and \x{} are used in fringe cases only of course. Consistently
settling on one would still benefit forward compatibility here.

>
> I think \x{xxxx} is misleading anyway - \xXX is always
> single-byte/character, yet Unicode code points can’t be represented
> in PHP strings as single bytes when encoded in UTF-8 (unless they’re
> below U+0100, of course). If I saw "\x{abcd}” I'd expect it to be the
> same as "\xab\xbc”. Plus, while Perl has \x{xxxx} syntax, Ruby and
> ECMAScript 6 have the \u{xxxx} syntax, so \u{xxxx} is already more
> popular. The ‘u’ in \u{xxxx} also makes it more obviously “Unicode”.

There's no question really about \u being more common and therefore
recognize~ and preferrable. Taking the cue from Ruby is appreciated!

Since the RFC rightly discounts the standard \uFFFF due to compatibility
reasons, there's however little visual and semantic distinction between
the {}-embellished variant \u{hhhhh} and a hypothetical \x{hhhhh}.

Not sure why or who would misinterpret \x{abc} as multi-bytes, really.
It's well understood and working for PCRE. The advantage of overloading
\x is a much lessened likelihood to ever encounter a residual "\x{" in
PHP strings.
Whereas "\u" is new, and never had an implicit payload constraint, thus
could run into a preexisting "\u{xxxx}" that was formerly targeted at a
later/distinct context.

Going with the Ruby theme; when piping a string there or receiving one
it's irrelevant who uses which syntax to preinterpret it. It's only
really interesting when exchanging string literals.
But the RFC and the patch don't cover stripcslashes() or addcslashes()
for instance.

So there's no direct string syntax interoperability earmarked for.
Which is why I brought forward \x{hhhhh} as alternative for within-PHP
consistency at least.
(Not bent on lobbying for x, as \u{…} is visually more pleasing; just
unsure about its scope.)

\u{1F44B}

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [VOTE][RFC] Unicode Codepoint Escape Syntax

Reply via email to