Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: Thomas Wetmore <email@hidden>
- Date: Thu, 05 Nov 2009 11:31:56 -0500
Ryan,
Note this quote from the Apple's "String Programming Guide for Cocoa":
"NSString objects are conceptually UTF-16 with platform endianness.
That doesn't necessarily imply anything about their internal storage
mechanism; what it means is that NSStringlengths, character indexes,
and ranges are expressed in terms of UTF-16 units, and that the term
“character” in NSString method names refers to 16-bit platform-
endian UTF-16 units. This is a common convention for string objects.
In most cases, clients don't need to be overly concerned with this; as
long as you are dealing with substrings, the precise interpretation of
the range indexes is not necessarily significant.
The vast majority of Unicode code points used for writing living
languages are represented by single UTF-16 units. However, some less
common Unicode code points are represented in UTF-16 by surrogate
pairs. A surrogate pair is a sequence of two UTF-16 units, taken from
specific reserved ranges, that together represent a single Unicode
code point. CFString has functions for converting between surrogate
pairs and the UTF-32 representation of the corresponding Unicode code
point. When dealing with NSString objects, one constraint is that
substring boundaries usually should not separate the two halves of a
surrogate pair. This is generally automatic for ranges returned from
most Cocoa methods, but if you are constructing substring ranges
yourself you should keep this in mind. However, this is not the only
constraint you should consider."
Thus the extended characters are being encoded in the NSString as two
UTF-16 surrogate pairs.
Tom Wetmore
On Nov 5, 2009, at 11:04 AM, Ryan Homer wrote:
Actually,
That was a bad example since \u only allows up to 4 digits, so the
string was in fact a length of 3 characters, the character '5' being
the 3rd. However, the issue still seems to exist.
I have the actual characters in a text file and an application that
imports the data. When the application imports the string with those
two characters, it returns a length of 3. I will paste the
characters directly into the string constant, though some people
might not be able to see them.
NSString *s = @"灵𤟥";
NSLog(@"%@ (length=%d)",s,s.length);
OR
NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
NSLog(@"%@ (length=%d)",s,s.length);
still returns a length of 3.
On 2009-11-05, at 10:52 AM, Ryan Homer wrote:
Actually,
That was a bad example since \u only allows up to 4 digits, so the
string was in fact a length of 3 characters, the character '5'
being the 3rd. I'm not sure how to escape this character
represented by unicode id 247e5, but the issue still seems to exist.
I have the actual characters in a text file and an application that
imports the data. When the application imports the string with
those two characters, it returns a length of 3. I will paste the
characters directly into the string constant, though some people
might not be able to see them.
NSString *s = @"灵𤟥";
NSLog(@"length=%d",s.length);
This still returns a length of 3.
On 2009-11-05, at 10:39 AM, Ryan Homer wrote:
Unicode 3.1 (2001) brought us Extension B (AFAIK) and the recent
Unicode 5.2 (2009-10-01) brings us Extension C. It seems to me
that NSString's length method/property does not return the proper
length for these characters.
Starting with a small example,
NSString *s = @"\u7075\u247e5";
NSLog(@"length=%d",s.length);
you'd think that the result would be 2. It is, however, 3. The
first character is a Chinese character from the CJK Unified
Ideographs range in the Han category. The second one is from the
Han Extension B range. In my very limited testing, this only seems
to occur for extension B & C characters, not ext. A. I'm wondering
if this is a bug in the way NSString handles ext. B and C
characters.
There are many characters that require more than one byte for
their internal Unicode representation. However, NSString still
counts a character as ONE character, regardless of the number of
bytes. So, it was surprising for me to get a length of 3 in the
above example.
Can someone provide any insight on this. I am thinking of filing a
bug with Apple but would like to hear what other people think
about this situation first as I'm not very well versed on the
intricacies of Unicode.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden