Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Alastair Houghton <email@hidden>
- Date: Fri, 6 Nov 2009 14:27:53 +0000
On 6 Nov 2009, at 13:38, Ryan Homer wrote:
These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>, which gives a number of examples and some algorithms for
determining grapheme cluster boundaries. Grapheme clusters are
similar to but not quite identical to composed character
sequences. For some purposes composed character sequences may be
sufficient; NSString gives prominence to the notion of composed
character sequence, because that is the most important concept for
arbitrary text processing, but if you are really interested in user-
perceived characters you may wish to use something else.
NSString already treats many composed characters as a grapheme
cluster, such as accented characters.
No, it doesn't. NSString is a container for UTF-16 *code units*, not
code points, and not grapheme clusters, and its methods reflect that
design choice (so -length is the length in UTF-16 code units, and the
indices are indices in UTF-16 code units too). If you want to count
grapheme clusters or composed character sequences or some such, you
will need to implement the necessary code to do so somehow.
While Chinese characters exhibit similar properties, being composed
of "parts" from the language perspective, parts which are most times
characters themselves, Unicode has never treated Chinese characters
as the composition of such, but rather as separate character in its
own right. In other words, there is no decomposition of a Chinese
character such as you can do with é, for example. I believe this is
the same for Japanese kanji as well.
Since NSString contains UTF-16 code units, a character outside the BMP
will be represented by a surrogate pair. The Chinese character you
mention is indeed outside of the BMP.
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden