Re: encoding of file names

Quincey Morris Tue, 24 May 2011 21:11:27 -0700

On May 24, 2011, at 17:33, Ken Thomases wrote:

>> I am sure this becomes more difficult with Arabic, Hebrew and Thai and other 
>> writing systems that have highly composed forms. (not sure if that's the 
>> right term)
> 
> Not really.


There *is* another level, described briefly here:

        
http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

As I understand things, there are at least 3 levels, informally at least:

1. Codepoints. Each Unicode codepoint is represented by 1, 2 or more 8, 16 or 
32 bit values (UTF-8, UTF-16, etc). I don't know if the individual 8, 16 or 32 
bit components have an official name. I call them "components".

2. Characters. Some Unicode characters consist of a base codepoint and one or 
more combining marks (accents). Some characters may representable as either a 
single codepoint (precomposed) or multiple codepoints (decomposed), and there 
are various normalization rule sets that specify the order and composition for 
various contexts.

3. Grapheme clusters. Some written units in some languages (such as Arabic, 
Hebrew and Thai) are made up of multiple characters.

This means that, in general, a single grapheme cluster may consist of a 
variable number of characters, which may each consist of a variable number of 
Unicode codepoints, which may each consist of a variable number of components.

Within Cocoa, the "native" string capabilities happen to be implemented in 
terms 16 bit components whose type is 'unichar'. (Specifically, 'unichar' is 
*not* a Unicode character type, nor even a Unicode codepoint type. It's a raw 
component value. This is in spite of the fact that NSString methods that access 
these components refer to them, incorrectly, as "characters".)

In class NSString, though, except when you specifically access individual 
components or use methods and options specifically relating to composition, 
strings are treated as *character* sequences, meaning that composition and 
normalization are generally handled transparently.

NSString only deals with grapheme clusters in a limited way 
('rangeOfComposedCharacterSequence...'). For more sophisticated capabilities, 
you need to move up to the Text system.

The document I linked to above also talks about a fourth level, which is 
related to text transformations such as upper- and lower-casing, which add 
another level of length variability in representation (the number of grapheme 
clusters in upper and lower case representation of the same text may be 
different).

AFAIK the file system operates at level 2, which means that composition and 
normalization are *not* significant in file name comparisons, though files 
names *are* stored with a canonical composition and normalization.

Ken, is that a correct statement of how it works?

> You just need to be aware of the semantics of the operations you're 
> performing so you can pick the right one -- i.e. isEqual: and 
> isEqualToString: perform literal comparision, while -compare: does not, and 
> the -compare:options:... methods let you choose that as well as 
> case-sensitivity, diacritic-sensitivity, and width-sensitivity.

And "literal" means component by component. The NSString class reference 
describes 'NSLiteralSearch' like this:

> Exact character-by-character equivalence.


I've always understand this to mean unichar by unichar, i.e. component by 
component, since the NSString documentation generally refers to components as 
"characters".

Here's what the NSString class reference says about 'isEqualToString:':

> The comparison uses the canonical representation of strings, which for a 
> particular string is the length of the string plus the Unicode characters 
> that make up the string. When this method compares two strings, if the 
> individual Unicodes are the same, then the strings are equal, regardless of 
> the backing store. “Literal” when applied to string comparison means that 
> various Unicode decomposition rules are not applied and Unicode characters 
> are individually compared. So, for instance, “Ö” represented as the composed 
> character sequence “O” and umlaut would not compare equal to “Ö” represented 
> as one Unicode character.

This make absolutely no sense unless the word "character" is here understood to 
mean "component".

Under this interpretation, NSString has no real codepoint by codepoint 
comparison. However, I believe that each codepoint point is represented by a 
*unique* UTF-16 component sequence, so a literal comparison amounts to the same 
thing as a codepoint by codepoint comparison.

Am I still on track here?




_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: encoding of file names

Reply via email to