Re: How to count composed characters in NSString?
Re: How to count composed characters in NSString?
- Subject: Re: How to count composed characters in NSString?
- From: David Niemeijer <email@hidden>
- Date: Tue, 30 Sep 2008 06:27:55 +0200
Hi Douglas and Peter,
On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:
On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
I need to be able to display the number of characters to the user
in a way that makes sense to them. If they see 3 I should report 3.
I also need it to cut-off certain input to the number of "real"
characters and should not generate results that only make sense for
a language like English where each 16 bits equals a single character.
What you are describing is the notion that Unicode sometimes refers
to as a "user-perceived character", which in general can be somewhat
ambiguous, since different users may have different perceptions, and
since there are writing systems in which character boundaries are
not at all similar to those in English. To handle this sort of
issue programmatically, Unicode defines what are known as "grapheme
clusters", but there is not a single notion of grapheme cluster;
there are several such notions, depending on precisely what it is
you want.
These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>, which gives a number of examples and some algorithms for
determining grapheme cluster boundaries. Grapheme clusters are
similar to but not quite identical to composed character sequences.
For some purposes composed character sequences may be sufficient;
NSString gives prominence to the notion of composed character
sequence, because that is the most important concept for arbitrary
text processing, but if you are really interested in user-perceived
characters you may wish to use something else.
Thanks for your clarification. It is indeed the "grapheme clusters"
that I am after. I need to be able to do things such as capitalize the
first letter of a string and in doing statistical text analysis
determine the number of "characters" of a text string. This
description from the URL you pointed at fits my use quite well:
"Grapheme cluster boundaries are important for collation, regular
expressions, UI interactions (such as mouse selection, arrow key
movement, backspacing), segmentation for vertical text, identification
of boundaries for first-letter styling, and counting “character”
positions within text." Using glyphs in this case is not appropriate
as in text analysis the text itself is not displayed, nor is using
[aString length] because it just reports the number of UTF-16 code
units. I realize there is no perfect approach, but I am just trying to
do something that brings me closest to what a user would expect.
Peter confirmed earlier that
CFStringGetRangeOfComposedCharactersAtIndex would be the way to go for
me. But, if I read Douglas' comment then I am beginning to wonder
whether this is the equivalent of UCFindTextBreak's
kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past I
used to use UCFindTextBreak with kUCTextBreakClusterMask, but unlike
NSString, UCFindTextBreak is not available on one of the platforms I
need to support, so what would be the right way to get at the cluster
breaks using the NSString API? (Please contact me off list if you need
further clarification.)
Cheers,
david._______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden