Re: status of cString (was const char* to char*)
Re: status of cString (was const char* to char*)
- Subject: Re: status of cString (was const char* to char*)
- From: "Louis C. Sacha" <email@hidden>
- Date: Fri, 28 Nov 2003 00:06:43 -0800
Hello...
Just to clarify my previous post (since I've recieved more off-list
replys to that message than spam in the last few hours, which is a
significant achievement, and the responses were pretty much split
between cString and UTF8 advocates):
I am aware that the Apple documentation for NSString specifically
says to use UTF8String instead of cString (which is deprecated and
will be removed in the future), but that makes specific (and often
incorrect) assumptions about what the string is being used for.
Because UTF8 is designed to preserve 16 bit characters in an 8 bit
format, it requires that certain byte values that would normally be
valid 8 bit characters are instead used for escape sequences, and
those 8 bit characters (and a few others) are remapped to 16 bit
sequences. The UTF8 encoding can cause errors in code that expects a
C style string if there are any multibyte characters in the original
NSString _or _ if there are any of the standard 8 bit characters that
UTF8 remaps in the NSString. So there are some NSStrings that are
valid as C style strings of length x where the UTF8 encoding has a
length of x+n, because of those remapped characters that would
normally be valid 8 bit characters. In addition to the operator error
involved (from using the wrong length for the resulting string),
there are times (including pre-existing code and platforms) when the
correct representation of the string requires cString encoding with
the full 8bit range for characters, and UTF8 is the wrong encoding to
choose.
I think that Apple knows what they are doing, but the documentation
is wrong/misleading.
Here's my take on what Apple is trying to accomplish:
1) cString is being deprecated because Apple wants developers to
learn to distinguish between when a C style string is appropriate and
when a UTF8 string is appropriate (or at least the issues involved in
making that choice), especially since issues with multibyte
characters from international languages are becoming more common.
Apple wants developers to be aware that NSStrings (especially user
entered strings) may contain multibyte characters allowed by unicode
that would cause the cString method to raise an exception since the
string can't be converted accurately.
2) The UTF8 encoding (often accidentally referred to as the UTF*
encoding in emails when someone doesn't let go of the shift key fast
enough) preserves all of the characters in the string, using 8 bits
for characters in the standard 7bit ASCII and most of the extended
ASCII chars, and only uses more than 8 bits for multibyte unicode
characters and the few regular 8 bit characters that were bumped. So
it doesn't cause the unnecessary memory/file bloat that any 16 bit
format does for Western characters, and still preserves the multibyte
characters.
3) NSString had two seperate simple methods for getting C style
strings out of an NSString, - (const char *)cString and - (const char
*) lossyCString. As far as I know, Apple has only deprecated the
cString method, not the lossyCString method. The difference is that
the cString method caused an exception if the NSString contained any
unicode characters that couldn't be translated directly to the ASCII
set, but the lossyCString method just did the best job it could and
returned a usable result (which is the behavior that most people
expected from cString in the first place). As long as the NSString
contains characters that are valid for the extended ASCII set, both
methods would return the same result, the only difference is what
happens for error conditions. The reason for choosing to keep
lossyCString (instead of changing cString to not cause an exception)
is that it explicitly indicates that information in the string may be
lost as a result if the NSString contains unicode characters that
don't map to the 8 bit C style string encoding.
My interpretation of what Apple should have said in the NSString documentation:
If you want a C style string, use - (const char*)lossyCString and be
aware that you might lose information about mulitbyte characters. If
you are working with NSStrings that might contain multibyte unicode
characters (which is especially important for localization, and user
entered strings) make sure to use - (const char*)UTF8String so that
no information is lost .
If your code works correctly now using the cString method (without
causing exceptions), and doesn't involve interation with user entered
strings (or multibyte characters are not valid input anyway), you
should be able to safely change your code to call the lossyCString
method instead and everything should continue to work just fine
(assuming that Apple doesn't decide to kill it in the future).
If you do need to use UTF8String, just remember that the length of
the UTF8 string != [NSString length], even if you are only using 8
bit characters, since some of the 8 bit characters are remapped to 16
bits.
My question:
Does anyone know what the official Apple policy is on the continued
existence of lossyCString? Based on the documentation I've read, it
is not being deprecated, and is proper way to get a C style string
out of an NSString. It would be nice to have an official answer,
though, especially since there seems to be a great deal of confusion
regarding which cString methods are on the chopping block. What is
the official Apple policy for getting and using cStrings from
NSStrings (and where it is documented if there is)?
Sorry for the long post, but I think this is an important issue...
Louis
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.