Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: status of cString (was const char* to char*)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: status of cString (was const char* to char*)

Subject: Re: status of cString (was const char* to char*)
From: "Louis C. Sacha" <email@hidden>
Date: Fri, 28 Nov 2003 00:06:43 -0800

Hello...

Just to clarify my previous post (since I've recieved more off-list replys to that message than spam in the last few hours, which is a significant achievement, and the responses were pretty much split between cString and UTF8 advocates):

I am aware that the Apple documentation for NSString specifically says to use UTF8String instead of cString (which is deprecated and will be removed in the future), but that makes specific (and often incorrect) assumptions about what the string is being used for. Because UTF8 is designed to preserve 16 bit characters in an 8 bit format, it requires that certain byte values that would normally be valid 8 bit characters are instead used for escape sequences, and those 8 bit characters (and a few others) are remapped to 16 bit sequences. The UTF8 encoding can cause errors in code that expects a C style string if there are any multibyte characters in the original NSString _or _ if there are any of the standard 8 bit characters that UTF8 remaps in the NSString. So there are some NSStrings that are valid as C style strings of length x where the UTF8 encoding has a length of x+n, because of those remapped characters that would normally be valid 8 bit characters. In addition to the operator error involved (from using the wrong length for the resulting string), there are times (including pre-existing code and platforms) when the correct representation of the string requires cString encoding with the full 8bit range for characters, and UTF8 is the wrong encoding to choose.

I think that Apple knows what they are doing, but the documentation is wrong/misleading.

Here's my take on what Apple is trying to accomplish:

1) cString is being deprecated because Apple wants developers to learn to distinguish between when a C style string is appropriate and when a UTF8 string is appropriate (or at least the issues involved in making that choice), especially since issues with multibyte characters from international languages are becoming more common. Apple wants developers to be aware that NSStrings (especially user entered strings) may contain multibyte characters allowed by unicode that would cause the cString method to raise an exception since the string can't be converted accurately.

2) The UTF8 encoding (often accidentally referred to as the UTF* encoding in emails when someone doesn't let go of the shift key fast enough) preserves all of the characters in the string, using 8 bits for characters in the standard 7bit ASCII and most of the extended ASCII chars, and only uses more than 8 bits for multibyte unicode characters and the few regular 8 bit characters that were bumped. So it doesn't cause the unnecessary memory/file bloat that any 16 bit format does for Western characters, and still preserves the multibyte characters.

3) NSString had two seperate simple methods for getting C style strings out of an NSString, - (const char *)cString and - (const char *) lossyCString. As far as I know, Apple has only deprecated the cString method, not the lossyCString method. The difference is that the cString method caused an exception if the NSString contained any unicode characters that couldn't be translated directly to the ASCII set, but the lossyCString method just did the best job it could and returned a usable result (which is the behavior that most people expected from cString in the first place). As long as the NSString contains characters that are valid for the extended ASCII set, both methods would return the same result, the only difference is what happens for error conditions. The reason for choosing to keep lossyCString (instead of changing cString to not cause an exception) is that it explicitly indicates that information in the string may be lost as a result if the NSString contains unicode characters that don't map to the 8 bit C style string encoding.

My interpretation of what Apple should have said in the NSString documentation:

If you want a C style string, use - (const char*)lossyCString and be aware that you might lose information about mulitbyte characters. If you are working with NSStrings that might contain multibyte unicode characters (which is especially important for localization, and user entered strings) make sure to use - (const char*)UTF8String so that no information is lost .

If your code works correctly now using the cString method (without causing exceptions), and doesn't involve interation with user entered strings (or multibyte characters are not valid input anyway), you should be able to safely change your code to call the lossyCString method instead and everything should continue to work just fine (assuming that Apple doesn't decide to kill it in the future).

If you do need to use UTF8String, just remember that the length of the UTF8 string != [NSString length], even if you are only using 8 bit characters, since some of the 8 bit characters are remapped to 16 bits.

My question:

Does anyone know what the official Apple policy is on the continued existence of lossyCString? Based on the documentation I've read, it is not being deprecated, and is proper way to get a C style string out of an NSString. It would be nice to have an official answer, though, especially since there seems to be a great deal of confusion regarding which cString methods are on the chopping block. What is the official Apple policy for getting and using cStrings from NSStrings (and where it is documented if there is)?

Sorry for the long post, but I think this is an important issue...

Louis
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: status of cString (was const char* to char*)
  - From: Ali Ozer <email@hidden>

References:
	>const char* to char* (From: David Cairns <email@hidden>)
	>Re: const char* to char* (From: Prachi Gauriar <email@hidden>)
	>*status of cString (was const char to char)* (From: "Louis C. Sacha" <email@hidden>)

Prev by Date: Re: const char* to char*
Next by Date: Re: Finding out the user's group
Previous by thread: status of cString (was const char* to char*)
Next by thread: Re: status of cString (was const char* to char*)
Index(es):
- Date
- Thread