Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: String Encoding Detection (Revisited)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String Encoding Detection (Revisited)

Subject: Re: String Encoding Detection (Revisited)
From: David Remahl <email@hidden>
Date: Fri, 8 Aug 2003 23:16:45 +0200

Here is one algorithm:

http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html

I haven't tried it, but I suggest you search some more on the net for various algorithms used to tell text encodings apart.

Just remember, that all text files are just bytes. The same byte sequence may mean totally different things depending on the encoding, and both may be just as valid (as long as you don't know which encoding was actually used to encode the text file). The autodetection algorithms you find, will invariably be based on heuristic. You should at least give the user the option of explicitly specifying an encoding.

/ Rgds, Davdi

On fredag 8 augusti 2003, at 22.43PM, Francisco Tolmasky wrote:

I know we may end up in a circle, but how will I "know it is UTF-8", and what auto-detection should I use? I've tried text sniffers but they've been pretty unsuccessful, especially when dealing with Unicode. Maybe they worked better back in the OS 9 days, but they certainly don't seem to give good results now.

On Friday, August 8, 2003, at 01:13 PM, David Remahl wrote:

And we're back where we started...

You will have to resort to higher level methods in finding this out. If it is a well defined file format, then you should probably know if it is UTF-8 or not. If it is a text file, then you may have to guess based on contents (auto-detection), or ask the user. TextEdit has a default text encoding for reading and one for saving. Other methods would include some OS functionality for storing text encoding data as meta-data in the file system. But Apple doesn't provide anything like that right now.

I did a search for "UTF-8 BOM considered harmful" on Google. You may want to consider doing the same. There are convincing arguments why it is a ReallyBadIdea.

/ Rgds, David

On fredag 8 augusti 2003, at 21.26PM, Francisco Tolmasky wrote:

Then how do I determine if it's UTF-8?

On Friday, August 8, 2003, at 12:17 PM, David Elliott wrote:

On Friday, August 8, 2003, at 08:27 AM, Clark S. Cox III wrote:

On Thursday, August 07, 2003, at 23:39, Francisco Tolmasky wrote:

How do I determine if the data is in beg endian or little endian? Or is just check for both FEFF and FFFE enough? Also, is there no b/l e difference in the utf-8 mode? (Do I check for any of "EF BB BF, or for all of those one after the other?)

Yes, checking the BOM (if it exists) will tell you which endian the data is, that's all you need; and yes, there is no endian difference in UTF-8 (as it's an 8-bit encoding)

Furthermore, you should NOT assume that a file beginning with EF BB BF is necessarily UTF-8. That is a valid 3 character string in any normal 8-bit encoding. If you do determine that the file is UTF-8, then you can go ahead and remove the BOM if you wish. And because certain badly behaved editors add a BOM to UTF-8, you must. As someone else mentioned, the Unicode people consider it a really bad idea to use a BOM in UTF-8, for this and other reasons.

-Dave

Francisco Tolmasky
email@hidden
http://users.adelphia.net/~ftolmasky
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

References:
	>Re: String Encoding Detection (Revisited) (From: Francisco Tolmasky <email@hidden>)

Prev by Date: Re: String Encoding Detection (Revisited)
Next by Date: how to create an NSData from an NSString
Previous by thread: Re: String Encoding Detection (Revisited)
Next by thread: Sample Source
Index(es):
- Date
- Thread