On Sun, Mar 14, 2010 at 3:23 PM, Jordi Boggiano <j.boggi...@seld.be> wrote: > On Sun, Mar 14, 2010 at 12:03 PM, Stan Vassilev <sv_for...@fmethod.com> wrote: >> UTF8 also takes 4 bytes for representing characters in the higher bit >> planes, as quite a lot of bits are lost for every char in order to describe >> how long the code point is, and when it ends and so on. This means >> memory-wise it may not be of big benefit to asian countries. > > I remember Brian Aker saying that they chose to work internally with > UTF-8 for Drizzle. His explanation of it was that asian countries have > so much english content mixed in that on average even for them UTF-8 > still had a lower footprint than UTF-16/32. I do not know where the > stats came from, but if it holds any truth it is worth considering.
The idea behind his reasonning was to about optimizing the 90% of the cases while being "fast enough" for the last 10% (could have been other numbers, but that's the idea). For what I remember about our discussions, he also mentioned fast UTF-8 capable string processing implementation (as fast as what UTF-16 could be). I like this the 90/10 approach especially as it actually matches what we have in PHP. Cheers, -- Pierre @pierrejoye | http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php