On Thu, Jan 21, 2010 at 8:40 AM, Corinna Vinschen wrote: > would somebody with Japanese and/or Chinese language background be so > kind to answer the below two questions?
I have some (outdated) background in I18N and Japanese L10N, though I'm not a native speaker of either Japanese or any Chinese language. So I can't offer native intuition, but I can relay some technical info that might be helpful: > When comparing strings linguistically (strcoll/wcscoll), > - are Hiragana and Katakana forms of the same character to be > treated as equal or as different? (Nit: they are not "the same character" in either the technical or traditional sense of "character"; they're the same syllable, but represented by different characters.) From the Unicode point of view, they are distinct; there is no defined equivalence, either canonical or compatibility, between corresponding Katakana and Hiragana syllables. The collation algorithm (which does take linguistic context into account) doesn't seem to say anything about such comparisons, though it's possible I missed something. But as a precedent which might be helpful, I note that with linguistic sensitivity active, Oracle 10g does compare Hiragana and Katakana forms of the same syllable as equal. > - are half-width and full-width forms of the same CJK character > treated as equal or as different? According to the Unicode normalization algorithm, half -width and full-width forms normalize to the same character, so they should be treated as equivalent. From the point of view of Unicode, there is no semantic difference, and the width property is informative, not normative. It's primarily encoded in Unicode to preserve round-trip compatibility with other standards, though it's also helpful for hints to rendering algorithms. -- Mark J. Reed <markjr...@gmail.com> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple