Re: characters in cocoa
Re: characters in cocoa
- Subject: Re: characters in cocoa
- From: Douglas Davidson <email@hidden>
- Date: Mon, 10 Sep 2007 10:19:00 -0700
On Sep 10, 2007, at 8:21 AM, Clark Cox wrote:
Ah, but UTF-16 code units are not characters; the term "UTF-16
character" is meaningless. For the BMP, there *is* a one-to-one
correspondence between UTF-16 code units and Unicode code points, but
this is not true in the general case. Outside of the BMP, it takes two
UTF-16 code units to represent a single Unicode code point.
We have this terminology problem for historical reasons;
characterAtIndex: antedates the introduction of surrogate pairs.
Whatever the terminology, NSStrings are conceptually UTF-16, and the -
length et al. methods reflect that. (This is a common practice in
other frameworks as well.)
Fortunately, as I mentioned, most developers should not have to worry
about this. If you work with ranges and substrings rather than with
individual characters, and use the NSString methods that deal with
ranges, they should automatically handle not only most issues with
surrogate pairs, but also the more common cases of combining
characters, Hangul, etc.
Chapter 2 of the Unicode 5 book has a very good discussion of "text
elements", which explains in great detail why it is that the elements
that are important for most text processes are in general sequences
of characters rather than single characters. Single characters are
important for the fundamental definitional purposes of the standard,
but in practice what one wishes to deal with for text processing is a
sequence of characters constituting a cluster or larger unit.
Douglas Davidson
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden