Re: [PHP-DEV] Unicode Implementation

Andrei Zmievski Thu, 06 Oct 2005 12:57:08 -0700

On Oct 6, 2005, at 10:56 AM, Derick Rethans wrote:

I am thinking that we're doing something with the unicodeimplementation andthat's that we're now getting duplicate implementations of quite somethings:functions, internal functions, hash implementations, two ways forstoringidentifiers... only because we need to support both IS_STRING andIS_UNICODE
and unicode=off mode.
I think I would prefer an IS_UNICODE/unicode=on only PHP.

This would mean that:
- no duplicate functionality for tons of functions that will makemaintaining
  the thing very hard


This is true.

- a cleaner (and a bit faster) Unicode implementation


This is true too.

- we have a bit less BC.

"A bit less"? I'd say it would break BC in a major way. People who wantto upgrade to PHP 6 would need to rewrite a lot of their scripts.

Internally we would only see IS_UNICODE and IS_BINARY, where we canhave asmall layer around extensions which return IS_STRING where weautomaticallyconvert it to and from unicode for those extensions. IS_STRING stringswill
still exist, but should not be there for the "user level".

For things like:
        $str = unicode_convert($unicode, 'iso-2022');
and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string,with allthe restrictions that we already have on those strings (like noautomatic
conversions).
Functions that work on binary strings can be quite limited (wewouldn't need astrtolower for that f.e.), so we are cutting down in a lot ofduplicated code.The same goes for not having to support both unicode=off andunicode=on mode,as that can make things a bit complicated too. This will limitfunctionality onbinary strings a bit though, but I think this is 10 times better thanan
unmaintainable PHP with Unicode support.

Sure, if you remove requirement for BC and merge the string/binarysemantics, you can use IS_BINARY for all that stuff.

Besides this, I ran some micro benchmarks on about 600 characters oftext witha few functions and benchmarked their behavior between unicode=1 andunicode=0
mode. Results:
strrev (100.000 iterations over 600 characters of normalized latintext):
        unicode off: 1.8secs
        unicode on:  5.0secs

strtoupper (100.000 iterations over the same text):
        unicode off: 2.2secs
        unicode on:  7.9secs

substr(50, 100) (1.000.000 over the same text):
        unicode off: 3.9secs
        unicode on: 11.9secs
This is something I find quite not acceptable, and we need to figureout a wayon how to optimize this - for substr the penalty is probably what weare usingan iterator and not a direct memcpy (because of surrogates), I am notso sure
about the others.

We can try switching to _UNSAFE versions of the iterator macros - theyassume well-formed UTF-16, so they will be somewhat faster.


-Andrei

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode Implementation

Reply via email to