On Nov 5, 2009, at 10:42 AM, Clark Cox wrote:

You don't even have to involve characters outside of the basic
multilingual plane for this to be an issue. Take, for example, the
string "müssen" (i.e. the verb "must" in German). There are two ways
of representing this string, one of which will have a length of 6,
while the other has a length of 7.

Surrogate pairs and combining character sequences are two simple examples of the general principle, which is that characters in a string from a programming perspective don't coincide with user- perceived characters. In most cases, the appropriate concept in Cocoa for dealing with this is the "composed character sequence", and NSString has methods for obtaining and iterating over composed character sequences. Using these methods will usually straighten out most of the issues developers have with this.

Here's something I wrote on this subject in a little more depth a while back:

"What you are describing is the notion that Unicode sometimes refers to as a "user-perceived character", which in general can be somewhat ambiguous, since different users may have different perceptions, and since there are writing systems in which character boundaries are not at all similar to those in English. To handle this sort of issue programmatically, Unicode defines what are known as "grapheme clusters", but there is not a single notion of grapheme cluster; there are several such notions, depending on precisely what it is you want.

These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries >, which gives a number of examples and some algorithms for determining grapheme cluster boundaries. Grapheme clusters are similar to but not quite identical to composed character sequences. For some purposes composed character sequences may be sufficient; NSString gives prominence to the notion of composed character sequence, because that is the most important concept for arbitrary text processing, but if you are really interested in user-perceived characters you may wish to use something else.

The most problematic scripts for this sort of determination include: handwriting-based scripts such as Arabic, in which (depending on the ligatures used in a particular font) character boundaries may not be readily perceptible; composed scripts such as Hangul, in which the script elements are in turn composed of smaller, individually meaningful graphic elements; and scripts involving reordering and combining, such as Devanagari and other Indic or Indic-influenced scripts.

There is still another similar but not quite identical notion, which is used for determining the number and position of insertion points during editing. In Leopard, NSLayoutManager has API support for determining insertion point positions within a line of text as it is laid out. Note that insertion point boundaries are not identical to glyph boundaries; a ligature glyph in some cases, such as an "fi" ligature in Latin script, may require an internal insertion point on a user-perceived character boundary."

Douglas Davidson

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to