Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Clark Cox <email@hidden>
- Date: Sat, 7 Nov 2009 10:59:56 -0800
On Sat, Nov 7, 2009 at 6:17 AM, Ryan Homer <email@hidden> wrote:
> [SOLVED]
>
> On 2009-11-06, at 12:42 PM, Clark Cox wrote:
>
>> On Fri, Nov 6, 2009 at 5:22 AM, Ryan Homer <email@hidden> wrote:
>>>
>>> On 2009-11-05, at 1:42 PM, Clark Cox wrote:
>>>
>>> Yes. I am importing characters from a text file and need to process them
>>> in
>>> a certain way. A word may have an alternate form which is denoted after
>>> the
>>> word in square brackets. When the alternate form contains some of the
>>> same
>>> characters in the same position, they are represented with a dash. It's
>>> more
>>> complicated than that in that there are alternate words and characters
>>> that
>>> are separated by / and //.
>>>
>>> Anyway, without getting into more details, the way I'm currently
>>> processing
>>> the data depends on the number of characters.
>>
>> Then you need to be very careful how you define "character".
>>
>> Is "ü" a single character, or two characters?
>
> When you define a string using ü, isn't it stored internally as one UTF-16
> code unit (not sure if I'm using the notation correctly), represented as
> U+00FC (which is one code unit), and then only if you decompose the string,
You have no guarantee that it is stored in a precomposed form to begin with.
> you'll have the base character 'u' stored as one unit and the umlaut stored
> as another?
Yes; however both the decomposed and precomposed forms are equally
valid. You must be prepared to accept both in your data, and you must
be prepared to treat them as the same "character" as far as the user
is concerned.
> In my experience, this is how is has seemed to me.
>
>> Is "킴" a single character, or three characters?
>
> I don't know about Korean characters, but when dealing with Chinese
> characters, we have, for example, 中, which I consider a character, and 心,
> another character and then a combination of the two, 忠. Now, you can't use
> decomposedStringWithCanonicalMapping on 忠 to get 中 and 心. So, when I say
> character in this context, 中, 心 and 忠 are each a single character.
Those hanzi are indeed represented by single codepoints in Unicode,
however, the character "킴" ("Kim"), in Korean can be stored either as
a single character or as three of them (in much the same way as "ü"
can be stored as one character or two). You code must be prepared to
accept either.
> It's just that I've never come across surrogate pairs in dealing with
> Chinese characters until now, now that I have to deal with extension B & C
> characters.
That's the thing about Unicode. None of these things are
script-specific, any code that processes UTF-16 text must be prepared
to encounter combining characters, surrogate pairs, and multiple
different code unit values that represent the same canonical
character.
>> ... then you can use
>> -rangeOfComposedCharacterSequenceAtIndex: to find the range of indices
>> (representing a single "character") that contain the given index.
>
> THANKS! This solves my problem.
>
> NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
> NSUInteger length = 0;
> for (NSUInteger i=0; i<s.length; i++) {
> NSRange r = [s rangeOfComposedCharacterSequenceAtIndex:i];
> length++;
> i += r.length;
> }
>
> The length is the 2 that I need!
Glad to have helped.
--
Clark S. Cox III
email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden