Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Ryan Homer <email@hidden>
- Date: Thu, 5 Nov 2009 11:04:49 -0500
Actually,
That was a bad example since \u only allows up to 4 digits, so the
string was in fact a length of 3 characters, the character '5' being
the 3rd. However, the issue still seems to exist.
I have the actual characters in a text file and an application that
imports the data. When the application imports the string with those
two characters, it returns a length of 3. I will paste the characters
directly into the string constant, though some people might not be
able to see them.
NSString *s = @"灵𤟥";
NSLog(@"%@ (length=%d)",s,s.length);
OR
NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
NSLog(@"%@ (length=%d)",s,s.length);
still returns a length of 3.
On 2009-11-05, at 10:52 AM, Ryan Homer wrote:
Actually,
That was a bad example since \u only allows up to 4 digits, so the
string was in fact a length of 3 characters, the character '5' being
the 3rd. I'm not sure how to escape this character represented by
unicode id 247e5, but the issue still seems to exist.
I have the actual characters in a text file and an application that
imports the data. When the application imports the string with those
two characters, it returns a length of 3. I will paste the
characters directly into the string constant, though some people
might not be able to see them.
NSString *s = @"灵𤟥";
NSLog(@"length=%d",s.length);
This still returns a length of 3.
On 2009-11-05, at 10:39 AM, Ryan Homer wrote:
Unicode 3.1 (2001) brought us Extension B (AFAIK) and the recent
Unicode 5.2 (2009-10-01) brings us Extension C. It seems to me that
NSString's length method/property does not return the proper length
for these characters.
Starting with a small example,
NSString *s = @"\u7075\u247e5";
NSLog(@"length=%d",s.length);
you'd think that the result would be 2. It is, however, 3. The
first character is a Chinese character from the CJK Unified
Ideographs range in the Han category. The second one is from the
Han Extension B range. In my very limited testing, this only seems
to occur for extension B & C characters, not ext. A. I'm wondering
if this is a bug in the way NSString handles ext. B and C characters.
There are many characters that require more than one byte for their
internal Unicode representation. However, NSString still counts a
character as ONE character, regardless of the number of bytes. So,
it was surprising for me to get a length of 3 in the above example.
Can someone provide any insight on this. I am thinking of filing a
bug with Apple but would like to hear what other people think about
this situation first as I'm not very well versed on the intricacies
of Unicode.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden