Re: [PHP-DEV] Unicode support

Aleksey Tulinov Tue, 14 Oct 2014 13:19:47 -0700

On 14/10/14 21:01, Rowan Collins wrote:

Rowan,

As I've mentioned before, a lot of the time what people actually want to
deal with is "grapheme clusters" - the kind of thing that you'd think of
as a character if you were writing by hand. Most people, if asked the
length of the string "noël", would answer 4, but there may be 5 code
points. (That's not just a case of normalisation choices; most
combinations of letter+diacritic have no single code point, that's why
the combining forms exist.)

Very good point. I'll give another example: is there a substring "s" instring "Maße"? If it's case-sensitive search, when there is no suchsubstring, but if it's case-insensitive search, then "ß" folds into "ss"and substring "s" appears.

This works both ways. For instance, if someone wants to split string"MASSE" after "ß" in case-insensitive manner, one approach might be: 1)find "ß" position, it's +2; 2) split string at +3. Result would be twostrings: "MAS" and "SE".

Back to combining characters, i dig the idea of introducing graphemes,but i think French person would write word "noël" using precomposedcharacter. I'm using French keyboard athttps://translate.google.com/#fr/. "ë" is Shift + "^", then "e", itproduces precomposed U+00EB.

If script doesn't have precomposed equivalent, then this grapheme willalways be in the same decomposed form and collation will work. Substringsearch will also work, because needle will be decomposed in the same wayas haystack. There are some border-line cases possible, but are theyreally practical in a scope of Unicode support in a programming language?


Any ideas?

P.S. Point about documentation taken.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode support

Reply via email to