Richard Lynch wrote:
> On Tue, July 10, 2007 7:06 pm, Larry Garfield wrote:
>> If 90% of the strings in use would work fine if treated as unicode,
>> then it
>> would make sense to just always assume Unicode unless explicitly
>> specified
>> otherwise.
> 
> If that 10% includes enough users who have written millions of line of
> code in a self-consistent manner that voids ALL their work, you may
> want to re-think this 90% number you have chosen...
> 
> And of course you need 2 distinct data types for Unicode and strings.
> 
> What I don't understand is why you'd lock things down so that:
> 
> a) the default "string" is Unicode, breaking XX% of existing applications
> 
> b) the end user can't readily change a) in a huge percentage of
> existing install base (read: non-dedicated hosting or mixed-user
> servers with shared httpd.conf settings)
> 
> 
> I realize it's far too late by now to do anything about it, most
> likely, but why in the world didn't you just choose a new keyword to
> define/declare a string as Unicode?
> 
> And did I dream the thread on this way back when where it was stated
> that Unicode was backwards-compatible, so this wouldn't be a problem?
> 
> Yet now it seems that UTF-16 is *not* backwards-compatible, and this
> seems like a pretty big problem to me.

Richard, you are rather confused on this Unicode stuff.  The fact that
PHP and ICU uses UTF-16 internally has absolutely nothing to do with
what is exposed at the scripting level.

The only things that will break in a standard application is stuff that
relies on strings being binary.  Normal text passing back and forth
between the browser and the server will work just fine.

The breakages, apart from various bugs at this early stage, are limited
to places where the code is expecting to see a binary string and PHP
hasn't been able to determine this automatically.  And hopefully we can
come up with ways to automatically determine when something should
default to a binary string.

But if you write:

$a = "マニュアル";
echo $a[1];

and you expect to have that spew out 0xe3, then yes, it will break
because it will result in ニ which is what it really should do.

And yes, I know a lot of people reading this list don't care much for
other charsets, but people reading an english mailing list are rather
self-selecting.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to