On 2/4/06, Jessie Hernandez <[EMAIL PROTECTED]> wrote:
> Hi Andrei,
>
> Pardon me for my ignorance, as I have not even looked at the Unicode
> stuff, but based on what you wrote, what about always allocating two
> UChars per codepoint? It would take a bit more space, but then
> random-offset indexing is fast and easy (the codepoint would always
> start at "index << 1").
what u say id UCS-2 not UTF-16.
i know little about icu either, but for those who're not familiar:
according to icu manual, UCS-2 is a subset of UTF-16 and is
deprecated. UTF-32(UCS-4?)takes more memory space which "kill
performance" for memory bandwidth. and there's a list of advantage of
UTF-16 over UTF-8 (on the icu manual). these reason makes icu all the
way to UTF-16.

i see no advantage with UTF-16 on the problem of random string offset
access because both take variable length of code unit(s) for 1 code
point.
both UCS-2/4 is good for random access.

imho, while u guys solving the problem with some way that is not 0
cost, it would be nice to have a mode to use UCS-2 instaed of UTF-16
for those who care about performance, at compile time, something like
php-src/configure --with-icu-encoding=UCS-2 (default to UTF-16). (the
code have to aware that whether ucs-2 is in used). code point out of
BMP isn't useful at all case.

Reply via email to