Tomas Kuliavas wrote:
If I remain silent, others will have arguments that "everybody agrees on
removal of unicode_semantics".
I write and maintain charset decoding and encoding functions.
unicode_semantics breaks every mapping table and other functions that
operate with binary 8bit strings.
Just curious, do these decoding/encoding functions do something that
Unicode support won't do?
In slides by Andrei Zmievski Unicode symbols are written with \u. Why are
they written with \x(hex) and \(octal) in current PHP6?
\x and \(octal) inside Unicode strings are assumed to specify Unicode
characters. This is one of the contention points, since a few people
have said that they should specify individual bytes rather than
characters, but in my opinion it's kind of dangerous since it may lead
to broken/invalid Unicode strings.
---
<?php
echo "\xC3\200";
---
I am not writing U+00C3 and U+0080, I am writing U+00C0 in UTF-8.
This should work fine inside binary strings..
I can bypass it by adding one line to every script that operates with
binary strings, but where are warranties that you won't dump declare()
support just like you dump unicode_semantics.
It won't get dumped. Unicode_semantics is a BC/transition switch.
declare() is crucial to proper script parsing.
What happens to your new
Unicode aware string functions, if I lie about strings' charset to PHP
interpreter?
You will get in trouble.
mb_strlen can't calculate correct $string length even when I
set correct charset in mb_strlen() arguments. If above code works as I
want in PHP6 unicode_semantics=on, mb_strlen($string,'utf-8') returns 2
and not 1.
I don't know what mbstring does or does not with unicode_semantics
switch, since it's meant to be deprecated.
-Andrei
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php