Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: converting text input in any encoding to unicode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: converting text input in any encoding to unicode

Subject: Re: converting text input in any encoding to unicode
From: Ben Dougall <email@hidden>
Date: Sun, 27 Apr 2003 21:26:11 +0100

On Sunday, April 27, 2003, at 05:20 pm, Andrew Thompson wrote:

Well, you've kind of answered your own question. HTML and XML do indeed indicate what the encoding of the text is (assuming you can trust the document author not to copy and paste the wrong thing).

yup, if the text is html or xml the information should be there - it's just a case of stripping it out somehow, and trusting the author is only reasonable really. if an incorrect encoding is specified in a document then you can only expect encoding problems from that document. but it's plain text that's troubling me. xml and html exist within plain text but they're further, more detailed formats - they shouldn't be too much of a problem.

The reason they do indicate their encoding is because it is in general so hard to guess the correct encoding. To put it another way, yes some file formats indicate their encoding, but these are generally the newer ones. There must be thousands of old file formats that give no indication whatsoever.

i'm quite surprised on that. i'd have thought indicating which encoding was a necessity for it to be readable. i guess it's a case of text files moving from system to system - before that was so prevalent and expected, it was common and reasonable for a system to assume the text will be in it's own format, whatever that might have been, and that kind of thing has hung around for a bit too long, maybe.

If you have a fair idea which formats your program is likely to encounter,

no i don't - just plain text files - that's about as specific as i can possibly be. any text files that you may happen to have on your drive, or get from the net. mainly files intended for humans though - containing natural language.

you can certainly try to read whatever encoding information they may have from them, but its likely to be in a different place in every file format.

even for plain text? i'm sure different formats like pdf and rtf etc all vary wildly in their encoding schemes - that's obvious. but are you saying that the encoding info location can vary even for plain text files? in fact, do plain text files have encoding info as standard, or is no encoding indication standard? i spose there could be variations between different platform's plain text files maybe? please tell me that's not the case. i've got a horrible feeling that that is the case. there's no such things as a or the standard plain text format is there?

i think what i need to find out about is the plain text format itself. and there's probably not one single standard.

this is the question i probably should have asked in the first place: do plain text files always / sometimes / never have encoding embedded? in a header maybe? rather than the way html or xml contains the encoding within the text itself.

how can i look at the raw contents of text files? when you open a text file in say bbedit, you just get the text - is there a unix command line tool that enables you to see the raw full contents of a file?

A concrete example of this process might be Mozilla. If you look in the View->Character Coding->Auto Detect menu you'll see an option called "Universal", which means "use an algorithm that tries to guess the character coding for every kind of file, considering all supported encodings" (As opposed to View->Character Coding->Auto Detect->Korean, eg, which indicates "I usually browse Korean web sites, so most likely what you'll find will be in some Korean text encoding, so limit your guesses to those").

As I understand it this was a very difficult algorithm to write/acquire. Also bear in mind the reason it works at all is because a web browser often has more to go on than just the file itself. A properly configured web server should send a MIME HTTP header indicating the encoding of the text, and since web browsers mostly display HTML and XHTML there's a fair chance the document indicates its own encoding. However the Universal algorithm often gets things wrong: documents and servers which indicate no encoding, or worse, the wrong encoding will quickly trip it up. That's why the View->Character Coding menu exists: to allow the user to override the program when it inevitably makes a mistake.

hmm, this is a bugger. you're obviously going to need to know the correct character encoding before converting into unicode - if you get the wrong encoding at that point it's not going to be good at all. i do need to convert to unicode though, or attempt to in any case.

If you're just opening files on disk you likely have even less to go on than your average web browser. So in the end, you can try to guess, but you must let the user override the guess if you don't want to drive them crazy. Often the most reliable indication is in the user's head. Open up Apple's TextEdit and do File->Open. See what they've done there at the bottom of the screen with the "Plain Text Encoding" box.

yeah, leave it up to the user. automated would be so much better though. it's not something that you really want to be bothered with.

thanks very much for your reply. it does highlight even more what a minefield it is though :)

thanks, ben.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: converting text input in any encoding to unicode
  - From: Eric Schlegel <email@hidden>

References:
	>Re: converting text input in any encoding to unicode (From: Andrew Thompson <email@hidden>)

Prev by Date: Re: Hiding an NSTextField
Next by Date: Re: Getting window locations for other applications
Previous by thread: Re: converting text input in any encoding to unicode
Next by thread: Re: converting text input in any encoding to unicode
Index(es):
- Date
- Thread