Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Douglas Davidson <email@hidden>
- Date: Thu, 5 Nov 2009 10:59:03 -0800
On Nov 5, 2009, at 10:42 AM, Clark Cox wrote:
You don't even have to involve characters outside of the basic
multilingual plane for this to be an issue. Take, for example, the
string "müssen" (i.e. the verb "must" in German). There are two ways
of representing this string, one of which will have a length of 6,
while the other has a length of 7.
Surrogate pairs and combining character sequences are two simple
examples of the general principle, which is that characters in a
string from a programming perspective don't coincide with user-
perceived characters. In most cases, the appropriate concept in Cocoa
for dealing with this is the "composed character sequence", and
NSString has methods for obtaining and iterating over composed
character sequences. Using these methods will usually straighten out
most of the issues developers have with this.
Here's something I wrote on this subject in a little more depth a
while back:
"What you are describing is the notion that Unicode sometimes refers
to as a "user-perceived character", which in general can be somewhat
ambiguous, since different users may have different perceptions, and
since there are writing systems in which character boundaries are not
at all similar to those in English. To handle this sort of issue
programmatically, Unicode defines what are known as "grapheme
clusters", but there is not a single notion of grapheme cluster; there
are several such notions, depending on precisely what it is you want.
These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>, which gives a number of examples and some algorithms for
determining grapheme cluster boundaries. Grapheme clusters are
similar to but not quite identical to composed character sequences.
For some purposes composed character sequences may be sufficient;
NSString gives prominence to the notion of composed character
sequence, because that is the most important concept for arbitrary
text processing, but if you are really interested in user-perceived
characters you may wish to use something else.
The most problematic scripts for this sort of determination include:
handwriting-based scripts such as Arabic, in which (depending on the
ligatures used in a particular font) character boundaries may not be
readily perceptible; composed scripts such as Hangul, in which the
script elements are in turn composed of smaller, individually
meaningful graphic elements; and scripts involving reordering and
combining, such as Devanagari and other Indic or Indic-influenced
scripts.
There is still another similar but not quite identical notion, which
is used for determining the number and position of insertion points
during editing. In Leopard, NSLayoutManager has API support for
determining insertion point positions within a line of text as it is
laid out. Note that insertion point boundaries are not identical to
glyph boundaries; a ligature glyph in some cases, such as an "fi"
ligature in Latin script, may require an internal insertion point on a
user-perceived character boundary."
Douglas Davidson
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden