• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: NSString's handling of Unicode extension B (and C) characters
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString's handling of Unicode extension B (and C) characters


  • Subject: Re: NSString's handling of Unicode extension B (and C) characters
  • From: Alastair Houghton <email@hidden>
  • Date: Sat, 7 Nov 2009 16:01:50 +0000

On 7 Nov 2009, at 14:17, Ryan Homer wrote:

On 2009-11-06, at 12:42 PM, Clark Cox wrote:

Is "ü" a single character, or two characters?

When you define a string using ü, isn't it stored internally as one UTF-16 code unit (not sure if I'm using the notation correctly), represented as U+00FC (which is one code unit),

No. It could be either U+00FC or the decomposed form U+0075 U+0308. It depends how it has been entered (wherever you enter it). This, incidentally, is one reason that it isn't trivial for the compiler to support character encodings; if your character encoding was ISO-8859-1 (ISO Latin 1) and you entered L"ü" (or @"ü") or similar, should that be represented by the precomposed sequence, or the decomposed sequence? And how about if you convert your source code to some other form where the accent is necessarily represented by a combining character?


You can only really guarantee that you have one or other form by asking for a particular canonical form; NSString has methods for that (e.g. -precomposedStringWithCanonicalMapping), but of course not all composed character sequences can be represented with precomposed characters in any case, and there's still the issue of surrogates, so this wouldn't really solve your problem.

... then you can use
-rangeOfComposedCharacterSequenceAtIndex: to find the range of indices
(representing a single "character") that contain the given index.

THANKS! This solves my problem.

If you don't already have it, it's a good idea if you're going to get into text processing with Cocoa to grab yourself a copy of the Unicode book, and maybe (since the Unicode book itself is pretty dry) a companion such as Richard Gillam's Unicode Demystified


<http://www.amazon.com/Unicode-Standard-Version-5-0-5th/dp/ 0321480910>
<http://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522 >


(Of course, you can download chapters from the Unicode book from unicode.org . Personally I like having a hard copy---though it *is* a huge tome...)

Kind regards,

Alastair.

--
http://alastairs-place.net



_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: NSString's handling of Unicode extension B (and C) characters
      • From: John Engelhart <email@hidden>
References: 
 >NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Clark Cox <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Clark Cox <email@hidden>)
 >Re: NSString's handling of Unicode extension B (and C) characters (From: Ryan Homer <email@hidden>)

  • Prev by Date: Re: popup menu entries
  • Next by Date: Re: Core-Data : how to merge two contexts ?
  • Previous by thread: Re: NSString's handling of Unicode extension B (and C) characters
  • Next by thread: Re: NSString's handling of Unicode extension B (and C) characters
  • Index(es):
    • Date
    • Thread