On Sat, Nov 7, 2009 at 11:01 AM, Alastair Houghton < alast...@alastairs-place.net> wrote:
> On 7 Nov 2009, at 14:17, Ryan Homer wrote: > > On 2009-11-06, at 12:42 PM, Clark Cox wrote: >> >> Is "ü" a single character, or two characters? >>> >> >> When you define a string using ü, isn't it stored internally as one UTF-16 >> code unit (not sure if I'm using the notation correctly), represented as >> U+00FC (which is one code unit), >> > > No. It could be either U+00FC or the decomposed form U+0075 U+0308. It > depends how it has been entered (wherever you enter it). This, > incidentally, is one reason that it isn't trivial for the compiler to > support character encodings; if your character encoding was ISO-8859-1 (ISO > Latin 1) and you entered L"ü" (or @"ü") or similar, should that be > represented by the precomposed sequence, or the decomposed sequence? And > how about if you convert your source code to some other form where the > accent is necessarily represented by a combining character? > To be clear, your example isn't really a compiler related issue, it's really more an example of the general problem of trans-literation between different character set encodings. The compiler (read: C99 / gcc) splits the problem in to two areas: the 'source character set' and the 'execution character set'. As a rough rule of thumb, gcc requires the source character set to be in ASCII / UTF-8. When character set conversions are required, gcc uses iconv, which uses Unicode to perform conversions. Though obviously not a requirement by any means, most of these issues will be dealt with using the Unicode standards. To that end, there's two Unicode standards that are particularly relevant: http://www.unicode.org/reports/tr15/ Unicode Normalization Forms http://www.unicode.org/reports/tr22/ Unicode Character Mapping Markup Language In particular, http://unicode.org/reports/tr15/#Legacy_Encodings says "If transcoders are implemented for legacy character sets, it is recommended that the result be in Normalization Form C where possible." Normalization Form C (or NFC) is defined as "Canonical Decomposition, followed by Canonical Composition". Although in no way guaranteed, it's a pretty safe bet that the end result of such transliterations will be the precomposed sequence. >From http://unicode.org/reports/tr15/#Norm_Forms - "Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence." > You can only really guarantee that you have one or other form by asking for > a particular canonical form; NSString has methods for that (e.g. > -precomposedStringWithCanonicalMapping), but of course not all composed > character sequences can be represented with precomposed characters in any > case, and there's still the issue of surrogates, so this wouldn't really > solve your problem. >From the -precomposedStringWithCanonicalMapping documentation: "A string made by normalizing the receiver’s contents using the Unicode Normalization Form C." Although this thread is a bit deep at this point, so it's not entirely clear from context, but it would seem that -precomposedStringWithCanonicalMapping should "solve [the] problem" since it is specifically designed, per the Unicode documentation, to do the following: "A binary comparison of the transformed strings will then determine equivalence." This, of course, assumes that both strings have been converted with -precomposedStringWithCanonicalMapping. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com