• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: NSString's handling of Unicode extension B (and C) characters
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString's handling of Unicode extension B (and C) characters


  • Subject: Re: NSString's handling of Unicode extension B (and C) characters
  • From: Ryan Homer <email@hidden>
  • Date: Thu, 5 Nov 2009 11:04:49 -0500

Actually,

That was a bad example since \u only allows up to 4 digits, so the string was in fact a length of 3 characters, the character '5' being the 3rd. However, the issue still seems to exist.

I have the actual characters in a text file and an application that imports the data. When the application imports the string with those two characters, it returns a length of 3. I will paste the characters directly into the string constant, though some people might not be able to see them.

NSString *s = @"灵𤟥";
NSLog(@"%@ (length=%d)",s,s.length);

OR

NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
NSLog(@"%@ (length=%d)",s,s.length);

still returns a length of 3.

On 2009-11-05, at 10:52 AM, Ryan Homer wrote:

Actually,

That was a bad example since \u only allows up to 4 digits, so the string was in fact a length of 3 characters, the character '5' being the 3rd. I'm not sure how to escape this character represented by unicode id 247e5, but the issue still seems to exist.

I have the actual characters in a text file and an application that imports the data. When the application imports the string with those two characters, it returns a length of 3. I will paste the characters directly into the string constant, though some people might not be able to see them.

NSString *s = @"灵𤟥";
NSLog(@"length=%d",s.length);

This still returns a length of 3.

On 2009-11-05, at 10:39 AM, Ryan Homer wrote:

Unicode 3.1 (2001) brought us Extension B (AFAIK) and the recent Unicode 5.2 (2009-10-01) brings us Extension C. It seems to me that NSString's length method/property does not return the proper length for these characters.

Starting with a small example,

	NSString *s = @"\u7075\u247e5";
	NSLog(@"length=%d",s.length);

you'd think that the result would be 2. It is, however, 3. The first character is a Chinese character from the CJK Unified Ideographs range in the Han category. The second one is from the Han Extension B range. In my very limited testing, this only seems to occur for extension B & C characters, not ext. A. I'm wondering if this is a bug in the way NSString handles ext. B and C characters.

There are many characters that require more than one byte for their internal Unicode representation. However, NSString still counts a character as ONE character, regardless of the number of bytes. So, it was surprising for me to get a length of 3 in the above example.

Can someone provide any insight on this. I am thinking of filing a bug with Apple but would like to hear what other people think about this situation first as I'm not very well versed on the intricacies of Unicode.


_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: NSString's handling of Unicode extension B (and C) characters
      • From: Clark Cox <email@hidden>
    • Re: NSString's handling of Unicode extension B (and C) characters
      • From: Thomas Wetmore <email@hidden>
References: 
 >NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)

  • Prev by Date: NSString's handling of Unicode extension B (and C) characters
  • Next by Date: Re: IBOutlet getting messed up in the runtime
  • Previous by thread: NSString's handling of Unicode extension B (and C) characters
  • Next by thread: Re: NSString's handling of Unicode extension B (and C) characters
  • Index(es):
    • Date
    • Thread