Re: characterAtIndex: method and composite characters
Re: characterAtIndex: method and composite characters
- Subject: Re: characterAtIndex: method and composite characters
- From: Deborah Goldsmith <email@hidden>
- Date: Fri, 6 Apr 2007 18:30:46 -0700
On Apr 4, 2007, at 9:42 AM, Douglas Davidson wrote:
On Apr 4, 2007, at 8:05 AM, Ewan Delanoy wrote:
-when an NSString or
NSAttributedString (let's call it s) appears on-screen as, say,
"(a with
tilda)(other characters ...)" is
it guaranteed that [s characterAtIndex: 0] will be "a with
tilda", and
not "a" (with "tilda" for a second
character) ?
-If this is not the case, I need a more accurate version of
"characterAtIndex:". Is this already
built-in ?
Yes. The characterAtIndex: method should be avoided wherever
possible; with Unicode strings, examining a single character
usually is not sufficient. Instead, use methods like
compare:options:range:, rangeOfString:options:range:, and
rangeOfCharacterFromSet:options:range:, which will give you the
Unicode-conformant operations you are looking for, with a wide
variety of options.
If you need to extract substrings, be sure to use
rangeOfComposedCharacterSequenceAtIndex: to make sure that you are
not dividing a composed character sequence. If you wish to replace
substrings in a mutable string, try
replaceOccurrencesOfString:withString:options:range:.
NSString does have methods to precompose or decompose an entire
string, but these methods are really useful only in special
circumstances--for example, when you are dealing with existing code
that for some reason requires one form or the other. Bear in mind
that most combinations of base characters and combining marks do
not have precomposed forms. In general, you are better off using
the methods mentioned above for Unicode-conformant comparisons.
In addition to what Doug says, bear in mind that even precomposed
Unicode cannot be accessed one "unichar" at a time. First, there may
still be surrogate pairs (two consecutive UTF-16 code units used to
represent characters beyond the first 16 bits of Unicode), and
second, there are some characters that cannot be represented by a
single Unicode code point, even in the canonical precomposed form of
Unicode (NFC == Normalization Form C). This is because Unicode does
not contain a precomposed version of the character in question.
Finally, even if there are no individual characters that require
multiple unichar's, some languages have linguistic units consisting
of multiple characters that shouldn't be broken apart.
Deborah Goldsmith
Internationalization, Unicode liaison
Apple Inc.
email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden