Hi Douglas and Peter,

On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:
On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:

I need to be able to display the number of characters to the user in a way that makes sense to them. If they see 3 I should report 3. I also need it to cut-off certain input to the number of "real" characters and should not generate results that only make sense for a language like English where each 16 bits equals a single character.

What you are describing is the notion that Unicode sometimes refers to as a "user-perceived character", which in general can be somewhat ambiguous, since different users may have different perceptions, and since there are writing systems in which character boundaries are not at all similar to those in English. To handle this sort of issue programmatically, Unicode defines what are known as "grapheme clusters", but there is not a single notion of grapheme cluster; there are several such notions, depending on precisely what it is you want.

These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries >, which gives a number of examples and some algorithms for determining grapheme cluster boundaries. Grapheme clusters are similar to but not quite identical to composed character sequences. For some purposes composed character sequences may be sufficient; NSString gives prominence to the notion of composed character sequence, because that is the most important concept for arbitrary text processing, but if you are really interested in user-perceived characters you may wish to use something else.

Thanks for your clarification. It is indeed the "grapheme clusters" that I am after. I need to be able to do things such as capitalize the first letter of a string and in doing statistical text analysis determine the number of "characters" of a text string. This description from the URL you pointed at fits my use quite well: "Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text." Using glyphs in this case is not appropriate as in text analysis the text itself is not displayed, nor is using [aString length] because it just reports the number of UTF-16 code units. I realize there is no perfect approach, but I am just trying to do something that brings me closest to what a user would expect.

Peter confirmed earlier that CFStringGetRangeOfComposedCharactersAtIndex would be the way to go for me. But, if I read Douglas' comment then I am beginning to wonder whether this is the equivalent of UCFindTextBreak's kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past I used to use UCFindTextBreak with kUCTextBreakClusterMask, but unlike NSString, UCFindTextBreak is not available on one of the platforms I need to support, so what would be the right way to get at the cluster breaks using the NSString API? (Please contact me off list if you need further clarification.)

Cheers,

david._______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to