[PHP-DEV] Unicode Implementation

Derick Rethans Thu, 06 Oct 2005 10:57:21 -0700

Hello!

I am thinking that we're doing something with the unicode implementation and
that's that we're now getting duplicate implementations of quite some things:
functions, internal functions, hash implementations, two ways for storing
identifiers... only because we need to support both IS_STRING and IS_UNICODE
and unicode=off mode.


I think I would prefer an IS_UNICODE/unicode=on only PHP.

This would mean that:
- no duplicate functionality for tons of functions that will make maintaining
  the thing very hard
- a cleaner (and a bit faster) Unicode implementation
- we have a bit less BC.

Internally we would only see IS_UNICODE and IS_BINARY, where we can have a
small layer around extensions which return IS_STRING where we automatically
convert it to and from unicode for those extensions. IS_STRING strings will
still exist, but should not be there for the "user level".

For things like:
        $str = unicode_convert($unicode, 'iso-2022');
and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, with all
the restrictions that we already have on those strings (like no automatic
conversions).

Functions that work on binary strings can be quite limited (we wouldn't need a
strtolower for that f.e.), so we are cutting down in a lot of duplicated code.
The same goes for not having to support both unicode=off and unicode=on mode,
as that can make things a bit complicated too. This will limit functionality on
binary strings a bit though, but I think this is 10 times better than an
unmaintainable PHP with Unicode support.

Besides this, I ran some micro benchmarks on about 600 characters of text with
a few functions and benchmarked their behavior between unicode=1 and unicode=0
mode. Results:

strrev (100.000 iterations over 600 characters of normalized latin text):
        unicode off: 1.8secs
        unicode on:  5.0secs

strtoupper (100.000 iterations over the same text):
        unicode off: 2.2secs
        unicode on:  7.9secs

substr(50, 100) (1.000.000 over the same text):
        unicode off: 3.9secs
        unicode on: 11.9secs

This is something I find quite not acceptable, and we need to figure out a way
on how to optimize this - for substr the penalty is probably what we are using
an iterator and not a direct memcpy (because of surrogates), I am not so sure
about the others.

regards,
Derick

-- 
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] Unicode Implementation

Reply via email to