Re: using UTF-32 in NSString.
Re: using UTF-32 in NSString.
- Subject: Re: using UTF-32 in NSString.
- From: John Engelhart <email@hidden>
- Date: Tue, 29 Jun 2010 19:11:55 -0400
On Sun, Jun 27, 2010 at 5:18 PM, Georg Seifert <email@hidden> wrote:
> Hi,
>
> Does anyone has information on how to use Unicode code points higher than
> 0xFFFF.
> I need to add some supplementary multilingual plane code points to a
> NSString.
>
> I can use something like this:
> NSString *aString = @"\\u0001ABCD"; //this prints fine but the
> [aString length] is 2
>
> But if I have the unicode value as a int (unichar is to small)
> int Char = 0x1ABCD;
> NSString *aString = [NSString stringWithFormat:@"%C", Char]; //The
> resulting string contains one character with a unicode value of "ABCD".
>
> What is the recommended way to use/create UTF-32 strings in Cocoa.
>
Others have pointed out some other solutions to this problem, but I thought
I'd toss in my $0.14143^2.
Taken from
http://regexkit.sourceforge.net/RegexKitLite/index.html#RegexKitLiteCookbook,
specifically the "Enhanced Copy To Clipboard Functionality" section,
which
is only visible when using Safari, so be sure that's the browser you're
using.
Use C99 \u character escapes
Normally, Unicode characters are embedded in string literals as the
characters UTF-8 byte sequence using \ddd octal escapes. When this option is
enabled, the C99 \u and \U character escape sequences are used instead. gcc
will issue a warning if \u character escape sequences are present and the
compiler is not configured to use the C99 (or later) standard (i.e., gcc
-std=(c|gnu)99).
Under the C99 standard, \u and \U are used to specify a universal character
name, which is a character encoded in the ISO/IEC 10646 character set
(essentially identical to Unicode in this context). Ultimately, a universal
character name is translated in to a sequence of bytes needed to represent
the designated character in the C environments execution character set.
Usually, although certainly not always, a string literal should be encoded
as UTF-8, which happens to be the default execution character set for gcc.
This is an important point to remember because the more convenient and
easier to use \u escape sequences are not guaranteed to convert in to a
specific sequence of bytes, unlike an octal \ddd or hex \xhh escape
sequence. There is currently no way to specify that a particular string
literal should always be translated using a specific character set encoding.
This may result in undefined behavior if the \u universal character name is
not translated in to the expected character set, which in this case must be
UTF-8.
Escaped Unicode in NSString literals
Prior to Xcode 3.0, gcc only supported the use of ASCII characters (i.e.,
characters ≤ 127) in constant NSString literals. If one needed to include
Unicode characters in an NSString, one would typically convert the string in
to UTF-8, and then create a NSString at run time using the
stringWithUTF8String: method, with the UTF-8 encoded C string passed as the
argument. For example, "€1.99", which contains the € euro symbol, would be
created using the following:
NSString *euroString = [NSString stringWithUTF8String:"\342\202\2541.99"];
// or with C99 \u character escapes:
NSString *euroString = [NSString stringWithUTF8String:"\u20ac1.99"];
One of the obvious disadvantages of this approach is that it instantiates a
new, autoreleased NSString each time it's used, unlike a constant NSString
literal like @"$1.99". Beginning with Xcode 3.0 and gcc 4.0, constant
NSString literals that contain Unicode characters can be specified directly
in source-code using the standard @"" syntax. For example:
NSString *euroString = @"\342\202\2541.99"; // or with C99 \u character
escapes:
NSString *euroString = @"\u20ac1.99";
The compiler converts these strings to UTF-16 using the endianness of the
target architecture. Since Mach-O object file format allows for multiple
architectures, this allows each architecture to encode the string as native
UTF-16 byte ordering for that architecture, so there are no issues with
proper byte ordering. Within the object file itself, these strings are
essentially identical to their ASCII-only counterparts: effectively they are
pre-instantiated objects. The only real difference is that the compiler sets
some internal CFString bits differently so that the CFString object knows
that the strings data is encoded as UTF-16 and not simple 8-bit data.
Although this functionality has been present since the release of 10.5, it
has only recently been documented in The Objective-C 2.0 Programming
Language - Compiler Directives, under the @"string" entry. A copy of the
relevant text is provided below:
On Mac OS X v10.4 and earlier, the string must be 7-bit ASCII-encoded. On
Mac OS X v10.5 and later (with Xcode 3.0 and later), you can also use UTF-16
encoded strings. (The runtime from Mac OS X v10.2 and later supports UTF-16
encoded strings, so if you use Mac OS X v10.5 to compile an application for
Mac OS X v10.2 and later, you can use UTF-16 encoded strings.)
--------
Some other points:
To encode a Unicode Code Point that is > 0xFFFF, you have two options if you
are using Xcode >= 3.0:
1) The easy way- Use \U (note \U, or a uppercase U), which takes the form of
"\UHHHHHHHH", where H is [0-9a-fA-F].
2) The hard way- Use \u (note \u, or a lowercase u), which requires you to
manually convert the code point in to UTF-16 surragate pairs.
An example of the character "𝄞", or U+1D11E, MUSICAL SYMBOL G CLEF:
1) @"\U0001D11E"
2) @"\uD834\uDD1E"
Note: There is only a single \, or backslash, in the above. Some people
have said you should use @"\\U0001D11E", which would give you the string
"\U0001D11E", not quite what you want.
If you're stuck using a tool chain where Xcode < 3.0, the "preferred" way
(for some value of preferred), is to use the UTF-8 encoding of the character
as follows:
3) [NSString stringWithUTF8String:"\360\235\204\236"]
Of course, if you're using Xcode >= 3.0 -AND- your source code is kept in
UTF8 (it might work for some other encodings, but UTF8 is the preferred and
recommend source code encoding anyways), the absolute easiest way by far is:
4) @"𝄞"
In other words, you can just paste text that can be represented in Unicode
directly in to your source code. The compiler will convert it to UTF-16
encoded constant NSStrings and store them in your object file.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden