>Good point. That's what i meant by border-line case. Could you possibly
>
>point me to a specific example of such false positive? I'm interested
>in 
>well-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8
>and 
>doesn't conform to shortest-form requirement.

You're confusing two concepts here: well-formed UTF-8 represents any single 
code point with the smallest number of bytes, but it makes no requirements 
about what code points are represented. Representing " ë " as two code points 
is perfectly valid Unicode, and would in fact be required under NFD.

That "most" input sources would prefer the combined form seems like a weak 
assumption to base a library on; it only takes one popular third-party to 
routinely return data in NFD for the problems to start showing up.

>> It's pretty meaningless to say you support Unicode, but only the easy
>> bits. You might as well just tag each string with one of the pages of
>> ISO-8859.
>>
>
>As far as i'm concerned Unicode specification does not require to 
>implement all annexes or even support entire character set to be 
>conformant. I think there are always trade-offs involved, depending on 
>what is more important for you.

Sure, but there are certain user expectations of what "Unicode support" means. 
Handling Korean characters in a meaningfulmeaningful way would definitely be on 
that list.

As I said at the top of my first post, the important thing is to capture what 
those requirements actually are. Just as you'd choose what array functions were 
needed if you were adding "array support" to a language.

To put it a different way, in what situation would you actively want to know 
the number of code points in a string, rather than either the number of bytes 
in its UTF8 representation, or the number of graphemes?

Reply via email to