BOM and UTF-8 (was Re: Question about line breaks and file types)
BOM and UTF-8 (was Re: Question about line breaks and file types)
- Subject: BOM and UTF-8 (was Re: Question about line breaks and file types)
- From: Dustin Voss <email@hidden>
- Date: Mon, 4 Aug 2003 20:44:21 -0700
On Monday, August 4, 2003, at 11:07 AM, Chuck Soper wrote:
On the subject of Unicode text files, I have some questions about the
byte order mark (BOM). Unicode text files may or may not contain a
byte order mark at the beginning of the file. The following code
automatically recognizes the encoding as UTF-8 only if the file has a
byte order mark.
NSString * myFile = @"~/myUTF8File.txt";
myFile = [inFileName stringByStandardizingPath];
NSString * source = [NSString stringWithContentsOfFile:myFile];
TextEdit does not write a BOM when saving a UTF-8 file so I use
BBEdit. The above code fails with TextEdit UTF-8 files. I assume that
that I could probably add a line of code to change the encoding, but I
want the code to recognize the encoding.
Should my code be changed to better recognize encodings?
Should TextEdit write a byte order mark for Unicode files?
Chuck
NSString (and CFString behaves the same) include a BOM in
"NSUnicodeStringEncoding"-encoded data, but does not include a BOM in
"NSUTF8StringEncoding"-encoded data. Apple hasn't provided any other
control over BOMs. They must have assumed that developers will be using
the encoding/decoding methods to read or write from text files in a
cross-platform manner. They included a BOM in UTF-16 because the BOM is
necessary there to determine byte order, but it isn't necessary in
UTF-8.
I don't think that Apple made the right decision there. If a UTF-16
string is part of a larger file format, the file format will probably
hard-code the byte order and not use a BOM. So, when decoding, the API
should allow you to specify the byte-order manually, and when encoding,
the API should leave the BOM off.
What Apple should do is add two more encodings:
"NSUnicodeLEStringEncoding" and "NSUnicodeBEStringEncoding". When you
use them to encode a string, the resulting data should not include the
BOM, and when decoding data, the system should assume the byte order is
as you specify. "NSUnicodeStringEncoding" should continue to work as it
does now: include the BOM when encoding and guess when decoding.
They might also want to rename "NSUnicodeStringEncoding" to
"NSUnicodeBOMStringEncoding" and add an "NSUTF8BOMStringEncoding" as
well. The latter would be identical to "NSUTF8StringEncoding", but
would include the BOM when encoding.
I think I will enter a request into RadarWeb.
This was all about NSString, of course, but you originally asked about
TextEdit. I agree that TextEdit should include the BOM in UTF-8. What's
the harm? If the file uses Unicode later, this will make it obvious up
front, and if the file does not use Unicode, it should be ASCII-encoded
anyway. I will enter a request for that as well.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.