Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: How to count composed characters in NSString?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to count composed characters in NSString?

Subject: Re: How to count composed characters in NSString?
From: Peter Edberg <email@hidden>
Date: Mon, 29 Sep 2008 22:58:25 -0700


On Sep 29, 2008, at 9:27 PM, David Niemeijer wrote:

Hi Douglas and Peter,
On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:
On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
I need to be able to display the number of characters to the user in a way that makes sense to them. If they see 3 I should report 3. I also need it to cut-off certain input to the number of "real" characters and should not generate results that only make sense for a language like English where each 16 bits equals a single character.
What you are describing is the notion that Unicode sometimes refers to as a "user-perceived character", which in general can be somewhat ambiguous, since different users may have different perceptions, and since there are writing systems in which character boundaries are not at all similar to those in English. To handle this sort of issue programmatically, Unicode defines what are known as "grapheme clusters", but there is not a single notion of grapheme cluster; there are several such notions, depending on precisely what it is you want.

These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries >, which gives a number of examples and some algorithms for determining grapheme cluster boundaries. Grapheme clusters are similar to but not quite identical to composed character sequences. For some purposes composed character sequences may be sufficient; NSString gives prominence to the notion of composed character sequence, because that is the most important concept for arbitrary text processing, but if you are really interested in user- perceived characters you may wish to use something else.
Thanks for your clarification. It is indeed the "grapheme clusters" that I am after. I need to be able to do things such as capitalize the first letter of a string and in doing statistical text analysis determine the number of "characters" of a text string. This description from the URL you pointed at fits my use quite well: "Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text." Using glyphs in this case is not appropriate as in text analysis the text itself is not displayed, nor is using [aString length] because it just reports the number of UTF-16 code units. I realize there is no perfect approach, but I am just trying to do something that brings me closest to what a user would expect.

Peter confirmed earlier that CFStringGetRangeOfComposedCharactersAtIndex would be the way to go for me. But, if I read Douglas' comment then I am beginning to wonder whether this is the equivalent of UCFindTextBreak's kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past I used to use UCFindTextBreak with kUCTextBreakClusterMask, but unlike NSString, UCFindTextBreak is not available on one of the platforms I need to support, so what would be the right way to get at the cluster breaks using the NSString API? (Please contact me off list if you need further clarification.)
Cheers,
david.

David, CFStringGetRangeOfComposedCharactersAtIndex and -[NSString rangeOfComposedCharacterSequenceAtIndex:] are the modern replacements for UCFindTextBreak with kUCTextBreakClusterMask and indeed they now are closer to the original intent of kUCTextBreakClusterMask that the current implementation of kUCTextBreakClusterMask is (since UCFindTextBreak was converted to follow Unicode/ICU default text segmentation rules).

The modern functions treat all of the following as a cluster: - A surrogate pair (of course, since it is a single character); - A base character followed by a sequence of combining marks (whether or not this is something that would be composed under NFC); - A Hangul syllable expressed as a sequence of conjoining jamo; - An Indic consonant cluster such as consonant + virama + consonant + vowel matra. It is this latter cluster that is no longer treated as a single entity by UCFindTextBreak with kUCTextBreakClusterMask.

-Peter

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: How to count composed characters in NSString?
From: David Niemeijer <email@hidden>


References:  
  >Re: How to count composed characters in NSString? (From: "Gerriet M. Denkmann" <email@hidden>)
  >Re: How to count composed characters in NSString? (From: Michael Gardner <email@hidden>)
  >Re: How to count composed characters in NSString? (From: David Niemeijer <email@hidden>)
  >Re: How to count composed characters in NSString? (From: Douglas Davidson <email@hidden>)
  >Re: How to count composed characters in NSString? (From: David Niemeijer <email@hidden>)




Prev by Date:
Re: NSManagedObject subclass accessor pattern mystery?

Next by Date:
re:tracking area pending install and disabled

Previous by thread:
Re: How to count composed characters in NSString?

Next by thread:
Re: How to count composed characters in NSString?

Index(es):

Date
Thread