Re: How to count composed characters in NSString?
Re: How to count composed characters in NSString?
- Subject: Re: How to count composed characters in NSString?
- From: Peter Edberg <email@hidden>
- Date: Mon, 29 Sep 2008 22:58:25 -0700
On Sep 29, 2008, at 9:27 PM, David Niemeijer wrote:
Hi Douglas and Peter,
On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:
On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
I need to be able to display the number of characters to the user
in a way that makes sense to them. If they see 3 I should report
3. I also need it to cut-off certain input to the number of "real"
characters and should not generate results that only make sense
for a language like English where each 16 bits equals a single
character.
What you are describing is the notion that Unicode sometimes refers
to as a "user-perceived character", which in general can be
somewhat ambiguous, since different users may have different
perceptions, and since there are writing systems in which character
boundaries are not at all similar to those in English. To handle
this sort of issue programmatically, Unicode defines what are known
as "grapheme clusters", but there is not a single notion of
grapheme cluster; there are several such notions, depending on
precisely what it is you want.
These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>, which gives a number of examples and some algorithms for
determining grapheme cluster boundaries. Grapheme clusters are
similar to but not quite identical to composed character
sequences. For some purposes composed character sequences may be
sufficient; NSString gives prominence to the notion of composed
character sequence, because that is the most important concept for
arbitrary text processing, but if you are really interested in user-
perceived characters you may wish to use something else.
Thanks for your clarification. It is indeed the "grapheme clusters"
that I am after. I need to be able to do things such as capitalize
the first letter of a string and in doing statistical text analysis
determine the number of "characters" of a text string. This
description from the URL you pointed at fits my use quite well:
"Grapheme cluster boundaries are important for collation, regular
expressions, UI interactions (such as mouse selection, arrow key
movement, backspacing), segmentation for vertical text,
identification of boundaries for first-letter styling, and counting
“character” positions within text." Using glyphs in this case is not
appropriate as in text analysis the text itself is not displayed,
nor is using [aString length] because it just reports the number of
UTF-16 code units. I realize there is no perfect approach, but I am
just trying to do something that brings me closest to what a user
would expect.
Peter confirmed earlier that
CFStringGetRangeOfComposedCharactersAtIndex would be the way to go
for me. But, if I read Douglas' comment then I am beginning to
wonder whether this is the equivalent of UCFindTextBreak's
kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past
I used to use UCFindTextBreak with kUCTextBreakClusterMask, but
unlike NSString, UCFindTextBreak is not available on one of the
platforms I need to support, so what would be the right way to get
at the cluster breaks using the NSString API? (Please contact me off
list if you need further clarification.)
Cheers,
david.
David,
CFStringGetRangeOfComposedCharactersAtIndex and -[NSString
rangeOfComposedCharacterSequenceAtIndex:] are the modern replacements
for UCFindTextBreak with kUCTextBreakClusterMask and indeed they now
are closer to the original intent of kUCTextBreakClusterMask that the
current implementation of kUCTextBreakClusterMask is (since
UCFindTextBreak was converted to follow Unicode/ICU default text
segmentation rules).
The modern functions treat all of the following as a cluster:
- A surrogate pair (of course, since it is a single character);
- A base character followed by a sequence of combining marks (whether
or not this is something that would be composed under NFC);
- A Hangul syllable expressed as a sequence of conjoining jamo;
- An Indic consonant cluster such as consonant + virama + consonant +
vowel matra. It is this latter cluster that is no longer treated as a
single entity by UCFindTextBreak with kUCTextBreakClusterMask.
-Peter
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden