Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Alastair Houghton <email@hidden>
- Date: Sat, 7 Nov 2009 16:01:50 +0000
On 7 Nov 2009, at 14:17, Ryan Homer wrote:
On 2009-11-06, at 12:42 PM, Clark Cox wrote:
Is "ü" a single character, or two characters?
When you define a string using ü, isn't it stored internally as one
UTF-16 code unit (not sure if I'm using the notation correctly),
represented as U+00FC (which is one code unit),
No. It could be either U+00FC or the decomposed form U+0075 U+0308.
It depends how it has been entered (wherever you enter it). This,
incidentally, is one reason that it isn't trivial for the compiler to
support character encodings; if your character encoding was ISO-8859-1
(ISO Latin 1) and you entered L"ü" (or @"ü") or similar, should that
be represented by the precomposed sequence, or the decomposed
sequence? And how about if you convert your source code to some other
form where the accent is necessarily represented by a combining
character?
You can only really guarantee that you have one or other form by
asking for a particular canonical form; NSString has methods for that
(e.g. -precomposedStringWithCanonicalMapping), but of course not all
composed character sequences can be represented with precomposed
characters in any case, and there's still the issue of surrogates, so
this wouldn't really solve your problem.
... then you can use
-rangeOfComposedCharacterSequenceAtIndex: to find the range of
indices
(representing a single "character") that contain the given index.
THANKS! This solves my problem.
If you don't already have it, it's a good idea if you're going to get
into text processing with Cocoa to grab yourself a copy of the Unicode
book, and maybe (since the Unicode book itself is pretty dry) a
companion such as Richard Gillam's Unicode Demystified
<http://www.amazon.com/Unicode-Standard-Version-5-0-5th/dp/
0321480910>
<http://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522
>
(Of course, you can download chapters from the Unicode book from unicode.org
. Personally I like having a hard copy---though it *is* a huge tome...)
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden