Re: NSString's handling of Unicode extension B (and C) characters
Re: NSString's handling of Unicode extension B (and C) characters
- Subject: Re: NSString's handling of Unicode extension B (and C) characters
- From: John Engelhart <email@hidden>
- Date: Sat, 7 Nov 2009 17:44:51 -0500
On Sat, Nov 7, 2009 at 11:01 AM, Alastair Houghton <
email@hidden> wrote:
> On 7 Nov 2009, at 14:17, Ryan Homer wrote:
>
> On 2009-11-06, at 12:42 PM, Clark Cox wrote:
>>
>> Is "ü" a single character, or two characters?
>>>
>>
>> When you define a string using ü, isn't it stored internally as one UTF-16
>> code unit (not sure if I'm using the notation correctly), represented as
>> U+00FC (which is one code unit),
>>
>
> No. It could be either U+00FC or the decomposed form U+0075 U+0308. It
> depends how it has been entered (wherever you enter it). This,
> incidentally, is one reason that it isn't trivial for the compiler to
> support character encodings; if your character encoding was ISO-8859-1 (ISO
> Latin 1) and you entered L"ü" (or @"ü") or similar, should that be
> represented by the precomposed sequence, or the decomposed sequence? And
> how about if you convert your source code to some other form where the
> accent is necessarily represented by a combining character?
>
To be clear, your example isn't really a compiler related issue, it's really
more an example of the general problem of trans-literation between different
character set encodings. The compiler (read: C99 / gcc) splits the problem
in to two areas: the 'source character set' and the 'execution character
set'. As a rough rule of thumb, gcc requires the source character set to be
in ASCII / UTF-8. When character set conversions are required, gcc uses
iconv, which uses Unicode to perform conversions.
Though obviously not a requirement by any means, most of these issues will
be dealt with using the Unicode standards. To that end, there's two Unicode
standards that are particularly relevant:
http://www.unicode.org/reports/tr15/ Unicode Normalization Forms
http://www.unicode.org/reports/tr22/ Unicode Character Mapping Markup
Language
In particular, http://unicode.org/reports/tr15/#Legacy_Encodings says "If
transcoders are implemented for legacy character sets, it is recommended
that the result be in Normalization Form C where possible." Normalization
Form C (or NFC) is defined as "Canonical Decomposition, followed by
Canonical Composition". Although in no way guaranteed, it's a pretty safe
bet that the end result of such transliterations will be the precomposed
sequence.
>From http://unicode.org/reports/tr15/#Norm_Forms - "Essentially, the Unicode
Normalization Algorithm puts all combining marks in a specified order, and
uses rules for decomposition and composition to transform each string into
one of the Unicode Normalization Forms. A binary comparison of the
transformed strings will then determine equivalence."
> You can only really guarantee that you have one or other form by asking for
> a particular canonical form; NSString has methods for that (e.g.
> -precomposedStringWithCanonicalMapping), but of course not all composed
> character sequences can be represented with precomposed characters in any
> case, and there's still the issue of surrogates, so this wouldn't really
> solve your problem.
>From the -precomposedStringWithCanonicalMapping documentation: "A string
made by normalizing the receiver’s contents using the Unicode Normalization
Form C."
Although this thread is a bit deep at this point, so it's not entirely clear
from context, but it would seem that -precomposedStringWithCanonicalMapping
should "solve [the] problem" since it is specifically designed, per the
Unicode documentation, to do the following: "A binary comparison of the
transformed strings will then determine equivalence." This, of course,
assumes that both strings have been converted with
-precomposedStringWithCanonicalMapping.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden