• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: NSString's handling of Unicode extension B (and C) characters
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString's handling of Unicode extension B (and C) characters


  • Subject: Re: NSString's handling of Unicode extension B (and C) characters
  • From: Ryan Homer <email@hidden>
  • Date: Fri, 6 Nov 2009 08:22:25 -0500

On 2009-11-05, at 1:42 PM, Clark Cox wrote:

On Thu, Nov 5, 2009 at 8:04 AM, Ryan Homer <email@hidden> wrote:
Actually,

That was a bad example since \u only allows up to 4 digits, so the string
was in fact a length of 3 characters, the character '5' being the 3rd.
However, the issue still seems to exist.


I have the actual characters in a text file and an application that imports
the data. When the application imports the string with those two characters,
it returns a length of 3. I will paste the characters directly into the
string constant, though some people might not be able to see them.


NSString *s = @"灵𤟥";
NSLog(@"%@ (length=%d)",s,s.length);

OR

NSString *s = @"\u7075\xf0\xa4\x9f\xa5";
NSLog(@"%@ (length=%d)",s,s.length);

still returns a length of 3.

NSString uses UTF-16, so your U+247e5 character is represented by two surrogate characters. In general, you should never expect the length of a string code units, as a programmer would see it, to match the length of characters in a string as users would see it.

You don't even have to involve characters outside of the basic
multilingual plane for this to be an issue. Take, for example, the
string "müssen" (i.e. the verb "must" in German). There are two ways
of representing this string, one of which will have a length of 6,
while the other has a length of 7.

Are you referring to the alternate form muessen or the decomposed form? While the decomposed form would have a length > 6, the string length of @"müessen" is correctly 6 because the umlaut is considered part of the 'u', unless decomposed. I was hoping for the same logic with U+247e5.



Is there any particular problem that this is causing in your code?

Yes. I am importing characters from a text file and need to process them in a certain way. A word may have an alternate form which is denoted after the word in square brackets. When the alternate form contains some of the same characters in the same position, they are represented with a dash. It's more complicated than that in that there are alternate words and characters that are separated by / and //.


Anyway, without getting into more details, the way I'm currently processing the data depends on the number of characters. Is there a way to count the surrogate pair as one character?

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: NSString's handling of Unicode extension B (and C) characters
      • From: Clark Cox <email@hidden>
References: 
 >NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Clark Cox <email@hidden>)

  • Prev by Date: Re: IKImageBrowserView and reordering
  • Next by Date: Re: NSString's handling of Unicode extension B (and C) characters
  • Previous by thread: Re: NSString's handling of Unicode extension B (and C) characters
  • Next by thread: Re: NSString's handling of Unicode extension B (and C) characters
  • Index(es):
    • Date
    • Thread