Re: characters in cocoa
Re: characters in cocoa
- Subject: Re: characters in cocoa
- From: "Gerriet M. Denkmann" <email@hidden>
- Date: Mon, 10 Sep 2007 13:38:01 +0200
On 7 Sep 2007, at 21:02, email@hidden wrote:
I would be obliged to hear from the experts what is considered the
most appropriate way to handle characters in Cocoa programming.
Thanks in advance.
We try to discourage developers from working at the level of
individual characters wherever possible, primarily because in Unicode
the individual character is usually not the appropriate level at
which to operate. This is something that's difficult for those of us
who were raised on char *'s to get used to, but it's important to get
right. In Unicode the appropriate object on which to operate for
most semantic purposes is (at least) a character cluster, such as a
base character and its combining marks, or a block of Hangul jamo.
In Cocoa terms this is a range of characters in an NSString; suitable
ranges can be obtained using such methods as
rangeOfComposedCharacterSequenceAtIndex:. This will also cover the
case of surrogate pairs that arises from NSString's use of UTF-16.
NSString/CFString supply a great variety of methods/functions that
operate on character ranges in a Unicode-conformant fashion: the
rangeOfCharacterFromSet:... methods, the rangeOfString: methods, the
compare:... methods, and so forth. They also provide a long list of
Unicode operations, such as casing, normalization, and other
transformations.
Even in apparently simple operations such as casing, the need for
operating on more than a single character is apparent. For example,
in German we have ß->SS on uppercasing, going from one character to
two; when we get to Greek, the complications increase significantly,
and there are many other examples from less prominent languages.
The basic recommendation for dealing with characters is to work with
strings, and ranges in strings, and substrings, and as much as
possible to use the NSString methods that deal with these; that lets
the kit handle all of the difficult Unicode issues. For those who
need to do their own low-level processing, and who are willing to
handle Unicode complications themselves, we provide access to UTF-16
directly via characterAtIndex: et al., and to other representations
with getBytes:... and related methods.
This is an excellent summary.
One might add that -[NSString length], which the documentation says
"Returns the number of Unicode characters in the receiver." does
nothing like this, but returns the number of shorts used with
NSUnicodeStringEncoding (aka Utf-16).
For example: [[NSString stringWithUTF8String: "𐐀" ] length] = 2 (if
someone cannot handle Unicode (like the mail digest software at
Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
string clearly contains one character.
And one should also note that "characterAtIndex:" does not do what
the name indicates, but returns the short at the index in utf-16.
getCharacters: "Returns by reference the characters from the
receiver." - the documentation really should mention in which
encoding these characters will be copied.
Maybe the documentation could be slightly improved: it is confusing
if it says "character" when it means "unsigned short int in a
specific (but unspecified) encoding".
Kind regards,
Gerriet.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden