Re: String Encoding Detection (Revisited)
Re: String Encoding Detection (Revisited)
- Subject: Re: String Encoding Detection (Revisited)
- From: David Remahl <email@hidden>
- Date: Fri, 8 Aug 2003 22:13:15 +0200
And we're back where we started...
You will have to resort to higher level methods in finding this out. If
it is a well defined file format, then you should probably know if it
is UTF-8 or not. If it is a text file, then you may have to guess based
on contents (auto-detection), or ask the user. TextEdit has a default
text encoding for reading and one for saving. Other methods would
include some OS functionality for storing text encoding data as
meta-data in the file system. But Apple doesn't provide anything like
that right now.
I did a search for "UTF-8 BOM considered harmful" on Google. You may
want to consider doing the same. There are convincing arguments why it
is a ReallyBadIdea.
/ Rgds, David
On fredag 8 augusti 2003, at 21.26PM, Francisco Tolmasky wrote:
Then how do I determine if it's UTF-8?
On Friday, August 8, 2003, at 12:17  PM, David Elliott wrote:
On Friday, August 8, 2003, at 08:27 AM, Clark S. Cox III wrote:
On Thursday, August 07, 2003, at 23:39, Francisco Tolmasky wrote:
How do I determine if the data is in beg endian or little endian?
Or is just check for both FEFF and FFFE enough?  Also, is there no
b/l e difference in the utf-8 mode?  (Do I check for any of "EF BB
BF, or for all of those one after the other?)
	Yes, checking the BOM (if it exists) will tell you which endian the
data is, that's all you need; and yes, there is no endian difference
in UTF-8 (as it's an 8-bit encoding)
Furthermore, you should NOT assume that a file beginning with EF BB
BF is necessarily UTF-8.  That is a valid 3 character string in any
normal 8-bit encoding.  If you do determine that the file is UTF-8,
then you can go ahead and remove the BOM if you wish.  And because
certain badly behaved editors add a BOM to UTF-8, you must.  As
someone else mentioned, the Unicode people consider it a really bad
idea to use a BOM in UTF-8, for this and other reasons.
-Dave
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: 
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.