On Oct 6, 2005, at 10:56 AM, Derick Rethans wrote:

I am thinking that we're doing something with the unicode implementation and that's that we're now getting duplicate implementations of quite some things: functions, internal functions, hash implementations, two ways for storing identifiers... only because we need to support both IS_STRING and IS_UNICODE
and unicode=off mode.

I think I would prefer an IS_UNICODE/unicode=on only PHP.

This would mean that:
- no duplicate functionality for tons of functions that will make maintaining
  the thing very hard

This is true.

- a cleaner (and a bit faster) Unicode implementation

This is true too.

- we have a bit less BC.

"A bit less"? I'd say it would break BC in a major way. People who want to upgrade to PHP 6 would need to rewrite a lot of their scripts.

Internally we would only see IS_UNICODE and IS_BINARY, where we can have a small layer around extensions which return IS_STRING where we automatically convert it to and from unicode for those extensions. IS_STRING strings will
still exist, but should not be there for the "user level".

For things like:
        $str = unicode_convert($unicode, 'iso-2022');
and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, with all the restrictions that we already have on those strings (like no automatic
conversions).

Functions that work on binary strings can be quite limited (we wouldn't need a strtolower for that f.e.), so we are cutting down in a lot of duplicated code. The same goes for not having to support both unicode=off and unicode=on mode, as that can make things a bit complicated too. This will limit functionality on binary strings a bit though, but I think this is 10 times better than an
unmaintainable PHP with Unicode support.

Sure, if you remove requirement for BC and merge the string/binary semantics, you can use IS_BINARY for all that stuff.

Besides this, I ran some micro benchmarks on about 600 characters of text with a few functions and benchmarked their behavior between unicode=1 and unicode=0
mode. Results:

strrev (100.000 iterations over 600 characters of normalized latin text):
        unicode off: 1.8secs
        unicode on:  5.0secs

strtoupper (100.000 iterations over the same text):
        unicode off: 2.2secs
        unicode on:  7.9secs

substr(50, 100) (1.000.000 over the same text):
        unicode off: 3.9secs
        unicode on: 11.9secs

This is something I find quite not acceptable, and we need to figure out a way on how to optimize this - for substr the penalty is probably what we are using an iterator and not a direct memcpy (because of surrogates), I am not so sure
about the others.

We can try switching to _UNSAFE versions of the iterator macros - they assume well-formed UTF-16, so they will be somewhat faster.

-Andrei

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to