Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: NSString's handling of Unicode extension B (and C) characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString's handling of Unicode extension B (and C) characters

Subject: Re: NSString's handling of Unicode extension B (and C) characters
From: Clark Cox <email@hidden>
Date: Fri, 6 Nov 2009 09:42:41 -0800

On Fri, Nov 6, 2009 at 5:22 AM, Ryan Homer <email@hidden> wrote:
> On 2009-11-05, at 1:42 PM, Clark Cox wrote:
>
>> On Thu, Nov 5, 2009 at 8:04 AM, Ryan Homer <email@hidden> wrote:
>>>
>>> Actually,
>>>
>>> That was a bad example since \u only allows up to 4 digits, so the string
>>> was in fact a length of 3 characters, the character '5' being the 3rd.
>>> However, the issue still seems to exist.
>>>
>>> I have the actual characters in a text file and an application that
>>> imports
>>> the data. When the application imports the string with those two
>>> characters,
>>> it returns a length of 3. I will paste the characters directly into the
>>> string constant, though some people might not be able to see them.
>>>
>>> NSString *s = @"灵𤟥";
>>> NSLog(@"%@ (length=%d)",s,s.length);
>>>
>>> OR
>>>
>>> NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
>>> NSLog(@"%@ (length=%d)",s,s.length);
>>>
>>> still returns a length of 3.
>>
>> NSString uses UTF-16, so your U+247e5 character is represented by two
>> surrogate characters. In general, you should never expect the length
>> of a string code units, as a programmer would see it, to match the
>> length of characters in a string as users would see it.
>>
>> You don't even have to involve characters outside of the basic
>> multilingual plane for this to be an issue. Take, for example, the
>> string "müssen" (i.e. the verb "must" in German). There are two ways
>> of representing this string, one of which will have a length of 6,
>> while the other has a length of 7.
>
> Are you referring to the alternate form muessen or the decomposed form?

I'm referring to decomposition.

> While the decomposed form would have a length > 6, the string length of
> @"müessen" is correctly 6 because the umlaut is considered part of the 'u',
> unless decomposed.

And any code that is indexing a Unicode string *must* be prepared to
accept both decomposed and precomposed forms.

> I was hoping for the same logic with U+247e5.

>> Is there any particular problem that this is causing in your code?
>
> Yes. I am importing characters from a text file and need to process them in
> a certain way. A word may have an alternate form which is denoted after the
> word in square brackets. When the alternate form contains some of the same
> characters in the same position, they are represented with a dash. It's more
> complicated than that in that there are alternate words and characters that
> are separated by / and //.
>
> Anyway, without getting into more details, the way I'm currently processing
> the data depends on the number of characters.

Then you need to be very careful how you define "character".

Is "ü" a single character, or two characters?
Is "킴" a single character, or three characters?
What about when they are decomposed?

If the answers to those are "yes", then you can use
-rangeOfComposedCharacterSequenceAtIndex: to find the range of indices
(representing a single "character") that contain the given index.

> Is there a way to count the surrogate pair as one character?

Not using the -length method. Conceptually, NSString deals in UTF-16
code units and only UTF-16 code units



--
Clark S. Cox III
email@hidden
_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: NSString's handling of Unicode extension B (and C) characters
  - From: Ryan Homer <email@hidden>

References:
	>NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
	>Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
	>Re: NSString's handling of Unicode extension B (and C) characters (From: Clark Cox <email@hidden>)
	>Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)

Prev by Date: Window setTitle Ignored at Launch?
Next by Date: Re: Window setTitle Ignored at Launch?
Previous by thread: Re: NSString's handling of Unicode extension B (and C) characters
Next by thread: Re: NSString's handling of Unicode extension B (and C) characters
Index(es):
- Date
- Thread