Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: converting text input in any encoding to unicode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: converting text input in any encoding to unicode

Subject: Re: converting text input in any encoding to unicode
From: Andrew Thompson <email@hidden>
Date: Sun, 27 Apr 2003 12:20:07 -0400

On Sunday, Apr 27, 2003, at 10:17 America/New_York, Ben Dougall wrote:

On Sunday, April 27, 2003, at 01:57 pm, Clark Cox III wrote:

On Sunday, April 27, 2003, at 07:32AM, Ben Dougall <email@hidden> wrote:

what's the best / usual way from a cocoa app to read in text that's
potentially encoded with any encoding, in order to store it internally
in your app in decomposed unicode? i'd like to be able to deal with as
many encodings as possible - and convert them to the base decomposed
unicode format in order to compare different texts confidently.

In order to do that, you'd need to have some idea of what encoding the text is in. You can try to discern some encodings, but others will be impossible to differentiate just from looking at the text itself.

surely most (all?) text files not only contains which characters it contains but which encoding they're in? i'd have thought that was a standard requirement for text?

You can usually identify Unicode text via the BOM, and you can be pretty sure that if the text does not contain any bytes that are greater than 127, then it can be interpreted as ASCII. Other than that, you'd some other hint as to the text's encoding.

unicode is one char encoding out of goodness knows how many. i guess different text systems have different methods for indicating which char encoding? html and xml indicate within the text itself which encoding it's in. i'd have thought all other text formats also indicate which enoding they're in, in one way or another - i guess 'in one way or another' is a stumbling block maybe. but there must be an already existing method to do that to a reasonable extent?

Well, you've kind of answered your own question. HTML and XML do indeed indicate what the encoding of the text is (assuming you can trust the document author not to copy and paste the wrong thing). The reason they do indicate their encoding is because it is in general so hard to guess the correct encoding. To put it another way, yes some file formats indicate their encoding, but these are generally the newer ones. There must be thousands of old file formats that give no indication whatsoever. If you have a fair idea which formats your program is likely to encounter, you can certainly try to read whatever encoding information they may have from them, but its likely to be in a different place in every file format.

A concrete example of this process might be Mozilla. If you look in the View->Character Coding->Auto Detect menu you'll see an option called "Universal", which means "use an algorithm that tries to guess the character coding for every kind of file, considering all supported encodings" (As opposed to View->Character Coding->Auto Detect->Korean, eg, which indicates "I usually browse Korean web sites, so most likely what you'll find will be in some Korean text encoding, so limit your guesses to those").

As I understand it this was a very difficult algorithm to write/acquire. Also bear in mind the reason it works at all is because a web browser often has more to go on than just the file itself. A properly configured web server should send a MIME HTTP header indicating the encoding of the text, and since web browsers mostly display HTML and XHTML there's a fair chance the document indicates its own encoding. However the Universal algorithm often gets things wrong: documents and servers which indicate no encoding, or worse, the wrong encoding will quickly trip it up. That's why the View->Character Coding menu exists: to allow the user to override the program when it inevitably makes a mistake.

If you're just opening files on disk you likely have even less to go on than your average web browser. So in the end, you can try to guess, but you must let the user override the guess if you don't want to drive them crazy. Often the most reliable indication is in the user's head. Open up Apple's TextEdit and do File->Open. See what they've done there at the bottom of the screen with the "Plain Text Encoding" box.

AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

(see you later space cowboy ...)
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: converting text input in any encoding to unicode
  - From: Ben Dougall <email@hidden>

References:
	>Re: converting text input in any encoding to unicode (From: Ben Dougall <email@hidden>)

Prev by Date: OT: Re: Hiding an NSTextField
Next by Date: Re: Treating bundles like directories in NSOpenPanel?
Previous by thread: Re: converting text input in any encoding to unicode
Next by thread: Re: converting text input in any encoding to unicode
Index(es):
- Date
- Thread