On 2/4/06, Jessie Hernandez <[EMAIL PROTECTED]> wrote: > Hi Andrei, > > Pardon me for my ignorance, as I have not even looked at the Unicode > stuff, but based on what you wrote, what about always allocating two > UChars per codepoint? It would take a bit more space, but then > random-offset indexing is fast and easy (the codepoint would always > start at "index << 1"). what u say id UCS-2 not UTF-16. i know little about icu either, but for those who're not familiar: according to icu manual, UCS-2 is a subset of UTF-16 and is deprecated. UTF-32(UCS-4?)takes more memory space which "kill performance" for memory bandwidth. and there's a list of advantage of UTF-16 over UTF-8 (on the icu manual). these reason makes icu all the way to UTF-16.
i see no advantage with UTF-16 on the problem of random string offset access because both take variable length of code unit(s) for 1 code point. both UCS-2/4 is good for random access. imho, while u guys solving the problem with some way that is not 0 cost, it would be nice to have a mode to use UCS-2 instaed of UTF-16 for those who care about performance, at compile time, something like php-src/configure --with-icu-encoding=UCS-2 (default to UTF-16). (the code have to aware that whether ucs-2 is in used). code point out of BMP isn't useful at all case.