On 14/10/14 21:01, Rowan Collins wrote:

Rowan,

As I've mentioned before, a lot of the time what people actually want to
deal with is "grapheme clusters" - the kind of thing that you'd think of
as a character if you were writing by hand. Most people, if asked the
length of the string "noël", would answer 4, but there may be 5 code
points. (That's not just a case of normalisation choices; most
combinations of letter+diacritic have no single code point, that's why
the combining forms exist.)


Very good point. I'll give another example: is there a substring "s" in string "Maße"? If it's case-sensitive search, when there is no such substring, but if it's case-insensitive search, then "ß" folds into "ss" and substring "s" appears.

This works both ways. For instance, if someone wants to split string "MASSE" after "ß" in case-insensitive manner, one approach might be: 1) find "ß" position, it's +2; 2) split string at +3. Result would be two strings: "MAS" and "SE".

Back to combining characters, i dig the idea of introducing graphemes, but i think French person would write word "noël" using precomposed character. I'm using French keyboard at https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it produces precomposed U+00EB.

If script doesn't have precomposed equivalent, then this grapheme will always be in the same decomposed form and collation will work. Substring search will also work, because needle will be decomposed in the same way as haystack. There are some border-line cases possible, but are they really practical in a scope of Unicode support in a programming language?

Any ideas?

P.S. Point about documentation taken.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to