Re: How to count composed characters in NSString?

David Niemeijer Mon, 29 Sep 2008 21:28:14 -0700

Hi Douglas and Peter,

On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:

On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
I need to be able to display the number of characters to the userin a way that makes sense to them. If they see 3 I should report 3.I also need it to cut-off certain input to the number of "real"characters and should not generate results that only make sense fora language like English where each 16 bits equals a single character.
What you are describing is the notion that Unicode sometimes refersto as a "user-perceived character", which in general can be somewhatambiguous, since different users may have different perceptions, andsince there are writing systems in which character boundaries arenot at all similar to those in English. To handle this sort ofissue programmatically, Unicode defines what are known as "graphemeclusters", but there is not a single notion of grapheme cluster;there are several such notions, depending on precisely what it isyou want.
These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>, which gives a number of examples and some algorithms fordetermining grapheme cluster boundaries. Grapheme clusters aresimilar to but not quite identical to composed character sequences.For some purposes composed character sequences may be sufficient;NSString gives prominence to the notion of composed charactersequence, because that is the most important concept for arbitrarytext processing, but if you are really interested in user-perceivedcharacters you may wish to use something else.

Thanks for your clarification. It is indeed the "grapheme clusters"that I am after. I need to be able to do things such as capitalize thefirst letter of a string and in doing statistical text analysisdetermine the number of "characters" of a text string. Thisdescription from the URL you pointed at fits my use quite well:"Grapheme cluster boundaries are important for collation, regularexpressions, UI interactions (such as mouse selection, arrow keymovement, backspacing), segmentation for vertical text, identificationof boundaries for first-letter styling, and counting “character”positions within text." Using glyphs in this case is not appropriateas in text analysis the text itself is not displayed, nor is using[aString length] because it just reports the number of UTF-16 codeunits. I realize there is no perfect approach, but I am just trying todo something that brings me closest to what a user would expect.

Peter confirmed earlier thatCFStringGetRangeOfComposedCharactersAtIndex would be the way to go forme. But, if I read Douglas' comment then I am beginning to wonderwhether this is the equivalent of UCFindTextBreak'skUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past Iused to use UCFindTextBreak with kUCTextBreakClusterMask, but unlikeNSString, UCFindTextBreak is not available on one of the platforms Ineed to support, so what would be the right way to get at the clusterbreaks using the NSString API? (Please contact me off list if you needfurther clarification.)


Cheers,

david._______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Re: How to count composed characters in NSString?

Reply via email to