Re: [PHP-DEV] Removal of unicode_semantics

Andrei Zmievski Wed, 07 May 2008 09:48:57 -0700

Tomas Kuliavas wrote:

If I remain silent, others will have arguments that "everybody agrees on
removal of unicode_semantics".


I write and maintain charset decoding and encoding functions.
unicode_semantics breaks every mapping table and other functions that
operate with binary 8bit strings.

Just curious, do these decoding/encoding functions do something thatUnicode support won't do?

In slides by Andrei Zmievski Unicode symbols are written with \u. Why are
they written with \x(hex) and \(octal) in current PHP6?

\x and \(octal) inside Unicode strings are assumed to specify Unicodecharacters. This is one of the contention points, since a few peoplehave said that they should specify individual bytes rather thancharacters, but in my opinion it's kind of dangerous since it may leadto broken/invalid Unicode strings.

---
<?php
echo "\xC3\200";
---
I am not writing U+00C3 and U+0080, I am writing U+00C0 in UTF-8.


This should work fine inside binary strings..

I can bypass it by adding one line to every script that operates with
binary strings, but where are warranties that you won't dump declare()
support just like you dump unicode_semantics.

It won't get dumped. Unicode_semantics is a BC/transition switch.declare() is crucial to proper script parsing.

What happens to your new
Unicode aware string functions, if I lie about strings' charset to PHP
interpreter?


You will get in trouble.

mb_strlen can't calculate correct $string length even when I
set correct charset in mb_strlen() arguments. If above code works as I
want in PHP6 unicode_semantics=on, mb_strlen($string,'utf-8') returns 2
and not 1.

I don't know what mbstring does or does not with unicode_semanticsswitch, since it's meant to be deprecated.


-Andrei

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Removal of unicode_semantics

Reply via email to