Re: converting text input in any encoding to unicode
Re: converting text input in any encoding to unicode
- Subject: Re: converting text input in any encoding to unicode
- From: arekkusu <email@hidden>
- Date: Sun, 27 Apr 2003 14:29:08 -0700
On Sunday, April 27, 2003, at 01:31 PM,
email@hidden wrote:
i'm quite surprised on that. i'd have thought indicating which encoding
was a necessity for it to be readable. i guess it's a case of text
files moving from system to system - before that was so prevalent and
expected, it was common and reasonable for a system to assume the text
will be in it's own format, whatever that might have been, and that
kind of thing has hung around for a bit too long, maybe.
A very long time. All languages are more or less arbitrary.
Will caveman X's wall drawings make sense to caveman Y? How about
English Morse code vs Japanese Morse code?
People generally just invent whatever they need at the time, often
without regard to future consequences. Now that technology has made
global communication easy, cross-language / cross-cultural /
cross-platform / issues are more important. So we have groups like
unicode.org to guide development of text encodings in a future-friendly
manner.
saying that the encoding info location can vary even for plain text
files? in fact, do plain text files have encoding info as standard, or
is no encoding indication standard? i spose there could be variations
between different platform's plain text files maybe? please tell me
that's not the case. i've got a horrible feeling that that is the case.
there's no such things as a or the standard plain text format is there?
Define "standard" first. It's not easy because it's subjective.
The baseline "standard" text encoding you are likely to encounter is 7
bit ASCII, which defines just enough characters to display plain
English text like I'm typing into this mail. No typographic niceties
like curly quotes or em-dashes, no accented characters for European
languages, and forget about more complex writing systems like Chinese
or Arabic.
To create "files intended for humans... containing natural language", 7
bit ASCII isn't enough, so over the last few decades, everybody
developed their own 8 and 16 bit extensions to ASCII independently.
Look at the Foundation documentation for NSString "Constants", and then
google search each type of encoding to see what it defines. Everything
past character 127 is defined arbitrarily, according to whatever needs
the developers had at the time.
Even something as simple as the linefeeds after each paragraph have
three common "standards": Mac, Win, and Unix.
In short, before Unicode, there was no standard, and if you are opening
a "plain text" file that does not have a Unicode BOM at the start,
there is no way to tell programatically (for sure) what the encoding
is. You need a human to look at it and see if it's corrupt.
Perhaps, in time, all text encoding will be Unicode. And perhaps we
will replace gasoline with solar power eventually, too. In the
meantime, you have to deal with encoding conversions.
how can i look at the raw contents of text files? when you open a text
file in say bbedit, you just get the text - is there a unix command
line tool that enables you to see the raw full contents of a file?
To see the true raw content, read the man page for hexdump. Any decent
text editor like BBEdit is doing at least some minimal interpretation
of linefeeds, and likely sniffing the encoding too.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.