NSString's handling of Unicode extension B (and C) characters
NSString's handling of Unicode extension B (and C) characters
- Subject: NSString's handling of Unicode extension B (and C) characters
- From: Ryan Homer <email@hidden>
- Date: Thu, 5 Nov 2009 10:39:19 -0500
Unicode 3.1 (2001) brought us Extension B (AFAIK) and the recent
Unicode 5.2 (2009-10-01) brings us Extension C. It seems to me that
NSString's length method/property does not return the proper length
for these characters.
Starting with a small example,
NSString *s = @"\u7075\u247e5";
NSLog(@"length=%d",s.length);
you'd think that the result would be 2. It is, however, 3. The first
character is a Chinese character from the CJK Unified Ideographs range
in the Han category. The second one is from the Han Extension B range.
In my very limited testing, this only seems to occur for extension B &
C characters, not ext. A. I'm wondering if this is a bug in the way
NSString handles ext. B and C characters.
There are many characters that require more than one byte for their
internal Unicode representation. However, NSString still counts a
character as ONE character, regardless of the number of bytes. So, it
was surprising for me to get a length of 3 in the above example.
Can someone provide any insight on this. I am thinking of filing a bug
with Apple but would like to hear what other people think about this
situation first as I'm not very well versed on the intricacies of
Unicode.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden