Re: NSString's handling of Unicode extension B (and C) characters

Douglas Davidson Thu, 05 Nov 2009 11:00:37 -0800


On Nov 5, 2009, at 10:42 AM, Clark Cox wrote:

You don't even have to involve characters outside of the basic
multilingual plane for this to be an issue. Take, for example, the
string "müssen" (i.e. the verb "must" in German). There are two ways
of representing this string, one of which will have a length of 6,
while the other has a length of 7.

Surrogate pairs and combining character sequences are two simpleexamples of the general principle, which is that characters in astring from a programming perspective don't coincide with user-perceived characters. In most cases, the appropriate concept in Cocoafor dealing with this is the "composed character sequence", andNSString has methods for obtaining and iterating over composedcharacter sequences. Using these methods will usually straighten outmost of the issues developers have with this.

Here's something I wrote on this subject in a little more depth awhile back:

"What you are describing is the notion that Unicode sometimes refersto as a "user-perceived character", which in general can be somewhatambiguous, since different users may have different perceptions, andsince there are writing systems in which character boundaries are notat all similar to those in English. To handle this sort of issueprogrammatically, Unicode defines what are known as "graphemeclusters", but there is not a single notion of grapheme cluster; thereare several such notions, depending on precisely what it is you want.

These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>, which gives a number of examples and some algorithms fordetermining grapheme cluster boundaries. Grapheme clusters aresimilar to but not quite identical to composed character sequences.For some purposes composed character sequences may be sufficient;NSString gives prominence to the notion of composed charactersequence, because that is the most important concept for arbitrarytext processing, but if you are really interested in user-perceivedcharacters you may wish to use something else.

The most problematic scripts for this sort of determination include:handwriting-based scripts such as Arabic, in which (depending on theligatures used in a particular font) character boundaries may not bereadily perceptible; composed scripts such as Hangul, in which thescript elements are in turn composed of smaller, individuallymeaningful graphic elements; and scripts involving reordering andcombining, such as Devanagari and other Indic or Indic-influencedscripts.

There is still another similar but not quite identical notion, whichis used for determining the number and position of insertion pointsduring editing. In Leopard, NSLayoutManager has API support fordetermining insertion point positions within a line of text as it islaid out. Note that insertion point boundaries are not identical toglyph boundaries; a ligature glyph in some cases, such as an "fi"ligature in Latin script, may require an internal insertion point on auser-perceived character boundary."


Douglas Davidson

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: NSString's handling of Unicode extension B (and C) characters

Reply via email to