Re: How to count composed characters in NSString?
Re: How to count composed characters in NSString?
- Subject: Re: How to count composed characters in NSString?
- From: Douglas Davidson <email@hidden>
- Date: Mon, 29 Sep 2008 09:39:14 -0700
On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
I need to be able to display the number of characters to the user in
a way that makes sense to them. If they see 3 I should report 3. I
also need it to cut-off certain input to the number of "real"
characters and should not generate results that only make sense for
a language like English where each 16 bits equals a single character.
What you are describing is the notion that Unicode sometimes refers to
as a "user-perceived character", which in general can be somewhat
ambiguous, since different users may have different perceptions, and
since there are writing systems in which character boundaries are not
at all similar to those in English. To handle this sort of issue
programmatically, Unicode defines what are known as "grapheme
clusters", but there is not a single notion of grapheme cluster; there
are several such notions, depending on precisely what it is you want.
These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>, which gives a number of examples and some algorithms for
determining grapheme cluster boundaries. Grapheme clusters are
similar to but not quite identical to composed character sequences.
For some purposes composed character sequences may be sufficient;
NSString gives prominence to the notion of composed character
sequence, because that is the most important concept for arbitrary
text processing, but if you are really interested in user-perceived
characters you may wish to use something else.
The most problematic scripts for this sort of determination include:
handwriting-based scripts such as Arabic, in which (depending on the
ligatures used in a particular font) character boundaries may not be
readily perceptible; composed scripts such as Hangul, in which the
script elements are in turn composed of smaller, individually
meaningful graphic elements; and scripts involving reordering and
combining, such as Devanagari and other Indic or Indic-influenced
scripts.
There is still another similar but not quite identical notion, which
is used for determining the number and position of insertion points
during editing. In Leopard, NSLayoutManager has API support for
determining insertion point positions within a line of text as it is
laid out. Note that insertion point boundaries are not identical to
glyph boundaries; a ligature glyph in some cases, such as an "fi"
ligature in Latin script, may require an internal insertion point on a
user-perceived character boundary.
Douglas Davidson
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden