On Sat, Nov 7, 2009 at 11:01 AM, Alastair Houghton <
alast...@alastairs-place.net> wrote:

> On 7 Nov 2009, at 14:17, Ryan Homer wrote:
>
>  On 2009-11-06, at 12:42 PM, Clark Cox wrote:
>>
>>  Is "ü" a single character, or two characters?
>>>
>>
>> When you define a string using ü, isn't it stored internally as one UTF-16
>> code unit (not sure if I'm using the notation correctly), represented as
>> U+00FC (which is one code unit),
>>
>
> No.  It could be either U+00FC or the decomposed form U+0075 U+0308.  It
> depends how it has been entered (wherever you enter it).  This,
> incidentally, is one reason that it isn't trivial for the compiler to
> support character encodings; if your character encoding was ISO-8859-1 (ISO
> Latin 1) and you entered L"ü" (or @"ü") or similar, should that be
> represented by the precomposed sequence, or the decomposed sequence?  And
> how about if you convert your source code to some other form where the
> accent is necessarily represented by a combining character?
>

To be clear, your example isn't really a compiler related issue, it's really
more an example of the general problem of trans-literation between different
character set encodings.  The compiler (read: C99 / gcc) splits the problem
in to two areas: the 'source character set' and the 'execution character
set'.  As a rough rule of thumb, gcc requires the source character set to be
in ASCII / UTF-8. When character set conversions are required, gcc uses
iconv, which uses Unicode to perform conversions.

Though obviously not a requirement by any means, most of these issues will
be dealt with using the Unicode standards.  To that end, there's two Unicode
standards that are particularly relevant:

http://www.unicode.org/reports/tr15/ Unicode Normalization Forms
http://www.unicode.org/reports/tr22/ Unicode Character Mapping Markup
Language

In particular, http://unicode.org/reports/tr15/#Legacy_Encodings says "If
transcoders are implemented for legacy character sets, it is recommended
that the result be in Normalization Form C where possible."  Normalization
Form C (or NFC) is defined as "Canonical Decomposition, followed by
Canonical Composition".  Although in no way guaranteed, it's a pretty safe
bet that the end result of such transliterations will be the precomposed
sequence.

>From http://unicode.org/reports/tr15/#Norm_Forms - "Essentially, the Unicode
Normalization Algorithm puts all combining marks in a specified order, and
uses rules for decomposition and composition to transform each string into
one of the Unicode Normalization Forms. A binary comparison of the
transformed strings will then determine equivalence."


> You can only really guarantee that you have one or other form by asking for
> a particular canonical form; NSString has methods for that (e.g.
> -precomposedStringWithCanonicalMapping), but of course not all composed
> character sequences can be represented with precomposed characters in any
> case, and there's still the issue of surrogates, so this wouldn't really
> solve your problem.


>From the -precomposedStringWithCanonicalMapping documentation: "A string
made by normalizing the receiver’s contents using the Unicode Normalization
Form C."

Although this thread is a bit deep at this point, so it's not entirely clear
from context, but it would seem that -precomposedStringWithCanonicalMapping
should "solve [the] problem" since it is specifically designed, per the
Unicode documentation, to do the following: "A binary comparison of the
transformed strings will then determine equivalence."  This, of course,
assumes that both strings have been converted with
-precomposedStringWithCanonicalMapping.
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to