Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: NSString's handling of Unicode extension B (and C) characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString's handling of Unicode extension B (and C) characters

Subject: Re: NSString's handling of Unicode extension B (and C) characters
From: Thomas Wetmore <email@hidden>
Date: Thu, 05 Nov 2009 11:31:56 -0500

Ryan,

Note this quote from the Apple's "String Programming Guide for Cocoa":

"NSString objects are conceptually UTF-16 with platform endianness. That doesn't necessarily imply anything about their internal storage mechanism; what it means is that NSStringlengths, character indexes, and ranges are expressed in terms of UTF-16 units, and that the term “character” in NSString method names refers to 16-bit platform- endian UTF-16 units. This is a common convention for string objects. In most cases, clients don't need to be overly concerned with this; as long as you are dealing with substrings, the precise interpretation of the range indexes is not necessarily significant.

The vast majority of Unicode code points used for writing living languages are represented by single UTF-16 units. However, some less common Unicode code points are represented in UTF-16 by surrogate pairs. A surrogate pair is a sequence of two UTF-16 units, taken from specific reserved ranges, that together represent a single Unicode code point. CFString has functions for converting between surrogate pairs and the UTF-32 representation of the corresponding Unicode code point. When dealing with NSString objects, one constraint is that substring boundaries usually should not separate the two halves of a surrogate pair. This is generally automatic for ranges returned from most Cocoa methods, but if you are constructing substring ranges yourself you should keep this in mind. However, this is not the only constraint you should consider."

Thus the extended characters are being encoded in the NSString as two UTF-16 surrogate pairs.

Tom Wetmore


On Nov 5, 2009, at 11:04 AM, Ryan Homer wrote:

Actually,
That was a bad example since \u only allows up to 4 digits, so the string was in fact a length of 3 characters, the character '5' being the 3rd. However, the issue still seems to exist.

I have the actual characters in a text file and an application that imports the data. When the application imports the string with those two characters, it returns a length of 3. I will paste the characters directly into the string constant, though some people might not be able to see them.
NSString *s = @"灵𤟥";
NSLog(@"%@ (length=%d)",s,s.length);
OR
NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
NSLog(@"%@ (length=%d)",s,s.length);
still returns a length of 3.
On 2009-11-05, at 10:52 AM, Ryan Homer wrote:
Actually,
That was a bad example since \u only allows up to 4 digits, so the string was in fact a length of 3 characters, the character '5' being the 3rd. I'm not sure how to escape this character represented by unicode id 247e5, but the issue still seems to exist.

I have the actual characters in a text file and an application that imports the data. When the application imports the string with those two characters, it returns a length of 3. I will paste the characters directly into the string constant, though some people might not be able to see them.
NSString *s = @"灵𤟥";
NSLog(@"length=%d",s.length);
This still returns a length of 3.
On 2009-11-05, at 10:39 AM, Ryan Homer wrote:
Unicode 3.1 (2001) brought us Extension B (AFAIK) and the recent Unicode 5.2 (2009-10-01) brings us Extension C. It seems to me that NSString's length method/property does not return the proper length for these characters.
Starting with a small example,
	NSString *s = @"\u7075\u247e5";
	NSLog(@"length=%d",s.length);
you'd think that the result would be 2. It is, however, 3. The first character is a Chinese character from the CJK Unified Ideographs range in the Han category. The second one is from the Han Extension B range. In my very limited testing, this only seems to occur for extension B & C characters, not ext. A. I'm wondering if this is a bug in the way NSString handles ext. B and C characters.

There are many characters that require more than one byte for their internal Unicode representation. However, NSString still counts a character as ONE character, regardless of the number of bytes. So, it was surprising for me to get a length of 3 in the above example.

Can someone provide any insight on this. I am thinking of filing a bug with Apple but would like to hear what other people think about this situation first as I'm not very well versed on the intricacies of Unicode.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



References:  
  >NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
  >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)




Prev by Date:
Re: IBOutlet getting messed up in the runtime

Next by Date:
Re: IBOutlet getting messed up in the runtime

Previous by thread:
Re: NSString's handling of Unicode extension B (and C) characters

Next by thread:
Re: NSString's handling of Unicode extension B (and C) characters

Index(es):

Date
Thread