On Tue, Nov 25, 2014 at 2:18 PM, Andrea Faulds <a...@ajf.me> wrote:

>
> > On 25 Nov 2014, at 10:41, Dmitry Stogov <dmi...@zend.com> wrote:
> >
> > u8"string" tells that the whole string is UTF-8 encoded.
> > Your escape Unicode proposal  assumes just UTF-8 codepoint, but the
> whole string encoding is still undefined.
>
> True. There’s an assumption there that you’re using a UTF-8-compatible
> source file. Actually, for other encodings, do we even guarantee that “\n”
> produces an ASCII LF just now? It certainly will on most Windows and Unix
> systems, but since we’re just using C’s ‘\n’ (
> http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_scanner.l#885), it
> might produce the newline character of some other encoding like EBCDIC in
> the right environment.
>
> > > If you're using other encodings, why do you want to use a Unicode
> codepoints? Most Unicode codepoints will not supported by another character
> set.
> >
> > Agree, this Unicode escapes are not going to be used for anything except
> UTF-8 encoded strings.
> > I'm not completely against it. It's just an incomplete solution.
> >
> > echo "\u{1F602}"; // won't output 😂 if the output encoding is not UTF-8
> >
> > echo "Привет \u{1F602}"; // won't output anything useful if script
> encoding is not UTF-8
> > The second problem present even for European counties that use
> Windows-1250 codepage.
> > echo "mañana \u{1F602}"; // won't output anything useful if script
> encoding is not UTF-8
> > Thanks. Dmitry.
> ot sy
> Yeah, that’s unfortunate. Although I don’t think there’s much we can do
> about it here. We can’t really convert, as if most Unicode characters won’t
> be available in the codepage you’re using.
>

If character is not available in codepage it's replaced with "?" or
something, but in you case we will get unexpected UTF-8 sequence.


>
> Even if we did have Unicode strings like the fabled PHP6 would have had,
> you still have this problem when you’re outputting in non-Unicode encodings.
>

Right, but just for output we already have HTML entities

echo "&#x1f602;" // HTML entities already work independently from encodings.

I know, it's not completely the same as "\u{1F602}", but "\u{...} assumes
UTF-8 is used everywhere and it's not true.

PHP6 was able to use Unicode escapes with any script encodings, because it
converted all the strings into some internal encoding anyway.
If we convert all strings from string encoding into the same internal
encoding (e.g. UTF-8 or user defined) than "\u{...}" will really work.

Thanks. Dmitry.


>
> Although it’s worth noting that mbstring *should* handle this, since if
> you have an internal encoding of UTF-8 and an output encoding of, say,
> Windows-1250, you can use UTF-8 in your strings it should convert that for
> you on output. How well this works in practice, however, I have no idea.
> --
> Andrea Faulds
> http://ajf.me/
>
>
>
>
>

Reply via email to