Re: converting text input in any encoding to unicode
Re: converting text input in any encoding to unicode
- Subject: Re: converting text input in any encoding to unicode
- From: Ben Dougall <email@hidden>
- Date: Sun, 27 Apr 2003 21:26:11 +0100
On Sunday, April 27, 2003, at 05:20 pm, Andrew Thompson wrote:
Well, you've kind of answered your own question. HTML and XML do
indeed indicate what the encoding of the text is (assuming you can
trust the document author not to copy and paste the wrong thing).
yup, if the text is html or xml the information should be there - it's
just a case of stripping it out somehow, and trusting the author is
only reasonable really. if an incorrect encoding is specified in a
document then you can only expect encoding problems from that document.
but it's plain text that's troubling me. xml and html exist within
plain text but they're further, more detailed formats - they shouldn't
be too much of a problem.
The reason they do indicate their encoding is because it is in general
so hard to guess the correct encoding. To put it another way, yes some
file formats indicate their encoding, but these are generally the
newer ones. There must be thousands of old file formats that give no
indication whatsoever.
i'm quite surprised on that. i'd have thought indicating which encoding
was a necessity for it to be readable. i guess it's a case of text
files moving from system to system - before that was so prevalent and
expected, it was common and reasonable for a system to assume the text
will be in it's own format, whatever that might have been, and that
kind of thing has hung around for a bit too long, maybe.
If you have a fair idea which formats your program is likely to
encounter,
no i don't - just plain text files - that's about as specific as i can
possibly be. any text files that you may happen to have on your drive,
or get from the net. mainly files intended for humans though -
containing natural language.
you can certainly try to read whatever encoding information they may
have from them, but its likely to be in a different place in every
file format.
even for plain text? i'm sure different formats like pdf and rtf etc
all vary wildly in their encoding schemes - that's obvious. but are you
saying that the encoding info location can vary even for plain text
files? in fact, do plain text files have encoding info as standard, or
is no encoding indication standard? i spose there could be variations
between different platform's plain text files maybe? please tell me
that's not the case. i've got a horrible feeling that that is the case.
there's no such things as a or the standard plain text format is there?
i think what i need to find out about is the plain text format itself.
and there's probably not one single standard.
this is the question i probably should have asked in the first place:
do plain text files always / sometimes / never have encoding embedded?
in a header maybe? rather than the way html or xml contains the
encoding within the text itself.
how can i look at the raw contents of text files? when you open a text
file in say bbedit, you just get the text - is there a unix command
line tool that enables you to see the raw full contents of a file?
A concrete example of this process might be Mozilla. If you look in
the View->Character Coding->Auto Detect menu you'll see an option
called "Universal", which means "use an algorithm that tries to guess
the character coding for every kind of file, considering all supported
encodings" (As opposed to View->Character Coding->Auto Detect->Korean,
eg, which indicates "I usually browse Korean web sites, so most likely
what you'll find will be in some Korean text encoding, so limit your
guesses to those").
As I understand it this was a very difficult algorithm to
write/acquire. Also bear in mind the reason it works at all is because
a web browser often has more to go on than just the file itself. A
properly configured web server should send a MIME HTTP header
indicating the encoding of the text, and since web browsers mostly
display HTML and XHTML there's a fair chance the document indicates
its own encoding. However the Universal algorithm often gets things
wrong: documents and servers which indicate no encoding, or worse, the
wrong encoding will quickly trip it up. That's why the View->Character
Coding menu exists: to allow the user to override the program when it
inevitably makes a mistake.
hmm, this is a bugger. you're obviously going to need to know the
correct character encoding before converting into unicode - if you get
the wrong encoding at that point it's not going to be good at all. i do
need to convert to unicode though, or attempt to in any case.
If you're just opening files on disk you likely have even less to go
on than your average web browser. So in the end, you can try to guess,
but you must let the user override the guess if you don't want to
drive them crazy. Often the most reliable indication is in the user's
head. Open up Apple's TextEdit and do File->Open. See what they've
done there at the bottom of the screen with the "Plain Text Encoding"
box.
yeah, leave it up to the user. automated would be so much better
though. it's not something that you really want to be bothered with.
thanks very much for your reply. it does highlight even more what a
minefield it is though :)
thanks, ben.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.