Re: [PHP-DEV] Where are we ACTUALLY on Unicode?

Lester Caine Tue, 16 Mar 2010 01:31:07 -0700

Stanislav Malyshev wrote:

Hi!

What I am probably asking is what was the brick wall PHP6 hit. I was
under the impression that there was no agreement on 'switchable or only'
to unicode core? ( And those who did write PHP6 books seemed to have
their own views on which way the discussions would go ;) ).


 From what I can see, the biggest issues are these:
1. Performance - Unicode-based PHP right now requires tons of
conversions when talking to outside world (like MySQL) which slows down
the app significantly. Many extensions frequently used by PHP app
writers (such as mysql, pcre, etc.) do not support UTF-16 properly.
Also, inflated memory usage hurts scalability a lot.
2. Compatibility - it's hard to make existing app works with Unicode and
doesn't lose in performance or doesn't have any weird scenarios where
your passwords suddenly stop working because there's an extra recoding
step in some md5() call.


I think that there does need to be a proper review of just what the target is?

There are a number of 'unknowns' such as how does one identify the version ofunicode being used. Differences seem to exist between OS's which don't help withthat problem?

On disk storage should probably be UTF-8 without any question? Windows use ofwidestrings for some files simple doubles up the on disk storage requirementsfor very little gain? And remembering to convert '.reg' files back to normal rawtext so I can read them on the Linux machines adds to the fun.

In memory handling of character strings is I think where some alternativemethods may be appropriate. Firebird's original UNICODE_FSS collation was 3bytes per character ( that IS the limit for Unicode ;) ) and so all of thecharacter counting stuff works transparently. Firebird records are automaticallycompressed before storage, so white space in character strings is not wastingspace on disk, and the unicode collations get compressed in the same way.

'3' is not a very processor friendly number, so working with 4 even thoughwasteful on memory, does make perfect sense. How long is it since we had a 640klimit on working memory? SERVERS should have a good amount of memory for cachinginformation anyway. SO is UTF-16 the right approach for processing wide strings?It needs special code to handle everything wider than 16 bits, but at what gainreally? If all core functionality is handled as 32 bit characters is there thatmuch of an overhead over the additional processing to get around strings ofdissimilar sizes in UTF-16 ?

Most of my own data handling is done via the database anyway, so queries returndata already sorted and filtered. There is no point pulling un-proccessed dataand then throwing much of it away, hence the rest of the infrastructure beingused is important to get the best performance?

Probably 90% of the time a string will come in and go out without requiring anyprocessing at all, so leave it as UTF-8 ? The only time we need to accuratelyknow the number and position of characters is when we need to do some stingprocessing, and then only if the strings use multibyte characters. SO how aboutan additional couple of flags on a string variable. When a UTF-8 string isloaded, it is counted for bytes, and characters, and number of bytes per. Ifbytes and characters are the same ... no problems. If number of bytes is greaterthan 1, then sting handling needs to 'open them up' before processing, and '2'just uses an efficient UTF-16 processing, while '3+' goes to 32 bit processing?

Am I missing something? Why does unicode have to complicate things when inreality they are quite simple? Legacy stuff gets converted to UTF-8 and in manycases the user will not even see a difference, but the 'unicode on/off' switchjust allows 127 single byte characters rather than 255 ? Currently all themultilingual stuff IS passing through PHP transparently and it would seem we canuse unicode for variable names? So what IS missing?


--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Where are we ACTUALLY on Unicode?

Reply via email to