Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Clark Cox <email@hidden>
- Date: Fri, 6 Nov 2009 09:42:41 -0800
On Fri, Nov 6, 2009 at 5:22 AM, Ryan Homer <email@hidden> wrote:
> On 2009-11-05, at 1:42 PM, Clark Cox wrote:
>
>> On Thu, Nov 5, 2009 at 8:04 AM, Ryan Homer <email@hidden> wrote:
>>>
>>> Actually,
>>>
>>> That was a bad example since \u only allows up to 4 digits, so the string
>>> was in fact a length of 3 characters, the character '5' being the 3rd.
>>> However, the issue still seems to exist.
>>>
>>> I have the actual characters in a text file and an application that
>>> imports
>>> the data. When the application imports the string with those two
>>> characters,
>>> it returns a length of 3. I will paste the characters directly into the
>>> string constant, though some people might not be able to see them.
>>>
>>> NSString *s = @"灵𤟥";
>>> NSLog(@"%@ (length=%d)",s,s.length);
>>>
>>> OR
>>>
>>> NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
>>> NSLog(@"%@ (length=%d)",s,s.length);
>>>
>>> still returns a length of 3.
>>
>> NSString uses UTF-16, so your U+247e5 character is represented by two
>> surrogate characters. In general, you should never expect the length
>> of a string code units, as a programmer would see it, to match the
>> length of characters in a string as users would see it.
>>
>> You don't even have to involve characters outside of the basic
>> multilingual plane for this to be an issue. Take, for example, the
>> string "müssen" (i.e. the verb "must" in German). There are two ways
>> of representing this string, one of which will have a length of 6,
>> while the other has a length of 7.
>
> Are you referring to the alternate form muessen or the decomposed form?
I'm referring to decomposition.
> While the decomposed form would have a length > 6, the string length of
> @"müessen" is correctly 6 because the umlaut is considered part of the 'u',
> unless decomposed.
And any code that is indexing a Unicode string *must* be prepared to
accept both decomposed and precomposed forms.
> I was hoping for the same logic with U+247e5.
>> Is there any particular problem that this is causing in your code?
>
> Yes. I am importing characters from a text file and need to process them in
> a certain way. A word may have an alternate form which is denoted after the
> word in square brackets. When the alternate form contains some of the same
> characters in the same position, they are represented with a dash. It's more
> complicated than that in that there are alternate words and characters that
> are separated by / and //.
>
> Anyway, without getting into more details, the way I'm currently processing
> the data depends on the number of characters.
Then you need to be very careful how you define "character".
Is "ü" a single character, or two characters?
Is "킴" a single character, or three characters?
What about when they are decomposed?
If the answers to those are "yes", then you can use
-rangeOfComposedCharacterSequenceAtIndex: to find the range of indices
(representing a single "character") that contain the given index.
> Is there a way to count the surrogate pair as one character?
Not using the -length method. Conceptually, NSString deals in UTF-16
code units and only UTF-16 code units
--
Clark S. Cox III
email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden