Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

BOM and UTF-8 (was Re: Question about line breaks and file types)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

BOM and UTF-8 (was Re: Question about line breaks and file types)

Subject: BOM and UTF-8 (was Re: Question about line breaks and file types)
From: Dustin Voss <email@hidden>
Date: Mon, 4 Aug 2003 20:44:21 -0700

On Monday, August 4, 2003, at 11:07 AM, Chuck Soper wrote:

On the subject of Unicode text files, I have some questions about the byte order mark (BOM). Unicode text files may or may not contain a byte order mark at the beginning of the file. The following code automatically recognizes the encoding as UTF-8 only if the file has a byte order mark.
NSString * myFile = @"~/myUTF8File.txt";
myFile = [inFileName stringByStandardizingPath];
NSString * source = [NSString stringWithContentsOfFile:myFile];

TextEdit does not write a BOM when saving a UTF-8 file so I use BBEdit. The above code fails with TextEdit UTF-8 files. I assume that that I could probably add a line of code to change the encoding, but I want the code to recognize the encoding.

Should my code be changed to better recognize encodings?
Should TextEdit write a byte order mark for Unicode files?
Chuck

NSString (and CFString behaves the same) include a BOM in "NSUnicodeStringEncoding"-encoded data, but does not include a BOM in "NSUTF8StringEncoding"-encoded data. Apple hasn't provided any other control over BOMs. They must have assumed that developers will be using the encoding/decoding methods to read or write from text files in a cross-platform manner. They included a BOM in UTF-16 because the BOM is necessary there to determine byte order, but it isn't necessary in UTF-8.

I don't think that Apple made the right decision there. If a UTF-16 string is part of a larger file format, the file format will probably hard-code the byte order and not use a BOM. So, when decoding, the API should allow you to specify the byte-order manually, and when encoding, the API should leave the BOM off.

What Apple should do is add two more encodings: "NSUnicodeLEStringEncoding" and "NSUnicodeBEStringEncoding". When you use them to encode a string, the resulting data should not include the BOM, and when decoding data, the system should assume the byte order is as you specify. "NSUnicodeStringEncoding" should continue to work as it does now: include the BOM when encoding and guess when decoding.

They might also want to rename "NSUnicodeStringEncoding" to "NSUnicodeBOMStringEncoding" and add an "NSUTF8BOMStringEncoding" as well. The latter would be identical to "NSUTF8StringEncoding", but would include the BOM when encoding.

I think I will enter a request into RadarWeb.

This was all about NSString, of course, but you originally asked about TextEdit. I agree that TextEdit should include the BOM in UTF-8. What's the harm? If the file uses Unicode later, this will make it obvious up front, and if the file does not use Unicode, it should be ASCII-encoded anyway. I will enter a request for that as well.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: BOM and UTF-8 (was Re: Question about line breaks and file types)
  - From: Andreas Mayer <email@hidden>

References:
	>Re: Question about line breaks and file types (From: Chuck Soper <email@hidden>)

Prev by Date: cell backgroundColor in OutlineView
Next by Date: Re: Arrow character in IB
Previous by thread: Re: Question about line breaks and file types
Next by thread: Re: BOM and UTF-8 (was Re: Question about line breaks and file types)
Index(es):
- Date
- Thread