On Oct 6, 2005, at 10:56 AM, Derick Rethans wrote:
I am thinking that we're doing something with the unicode
implementation and
that's that we're now getting duplicate implementations of quite some
things:
functions, internal functions, hash implementations, two ways for
storing
identifiers... only because we need to support both IS_STRING and
IS_UNICODE
and unicode=off mode.
I think I would prefer an IS_UNICODE/unicode=on only PHP.
This would mean that:
- no duplicate functionality for tons of functions that will make
maintaining
the thing very hard
This is true.
- a cleaner (and a bit faster) Unicode implementation
This is true too.
- we have a bit less BC.
"A bit less"? I'd say it would break BC in a major way. People who want
to upgrade to PHP 6 would need to rewrite a lot of their scripts.
Internally we would only see IS_UNICODE and IS_BINARY, where we can
have a
small layer around extensions which return IS_STRING where we
automatically
convert it to and from unicode for those extensions. IS_STRING strings
will
still exist, but should not be there for the "user level".
For things like:
$str = unicode_convert($unicode, 'iso-2022');
and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string,
with all
the restrictions that we already have on those strings (like no
automatic
conversions).
Functions that work on binary strings can be quite limited (we
wouldn't need a
strtolower for that f.e.), so we are cutting down in a lot of
duplicated code.
The same goes for not having to support both unicode=off and
unicode=on mode,
as that can make things a bit complicated too. This will limit
functionality on
binary strings a bit though, but I think this is 10 times better than
an
unmaintainable PHP with Unicode support.
Sure, if you remove requirement for BC and merge the string/binary
semantics, you can use IS_BINARY for all that stuff.
Besides this, I ran some micro benchmarks on about 600 characters of
text with
a few functions and benchmarked their behavior between unicode=1 and
unicode=0
mode. Results:
strrev (100.000 iterations over 600 characters of normalized latin
text):
unicode off: 1.8secs
unicode on: 5.0secs
strtoupper (100.000 iterations over the same text):
unicode off: 2.2secs
unicode on: 7.9secs
substr(50, 100) (1.000.000 over the same text):
unicode off: 3.9secs
unicode on: 11.9secs
This is something I find quite not acceptable, and we need to figure
out a way
on how to optimize this - for substr the penalty is probably what we
are using
an iterator and not a direct memcpy (because of surrogates), I am not
so sure
about the others.
We can try switching to _UNSAFE versions of the iterator macros - they
assume well-formed UTF-16, so they will be somewhat faster.
-Andrei
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php