Re: Data to String: what encoding?
Re: Data to String: what encoding?
- Subject: Re: Data to String: what encoding?
- From: Todd Blanchard <email@hidden>
- Date: Mon, 16 Sep 2002 15:29:12 +0200
WAKE UP - TIME TO DIE!
OK, not exactly, but I'm guessing you're an american. Its OK, so am I.
So there's hope for you yet.
Character Encoding 101 starts NOW.
0) A character is an abstract idea. It has nothing to do with what you
see on the screen. The letter D is the letter D no matter if its block
text or drawn like the Detroit Tigers do on their ball caps.
1) A character SET is a set of characters (duh). The set defines the
entire alphabet, has nothing to do with computers, and is a completely
abstract entity defined as a set of abstract entities called
characters. { A B C Z & ^ < } is a set of characters. The english
alphabet is a character set. The hebrew alphabet is a character set.
The ,.?;:!- characters make up the set we commonly refer to as
punctuation.
2) A character encoding is description of a representation of a
character set in a computer. If I say A=4, B=8, C=3.14, Z=-29, &=44,
^=13, <=-29000. Thats a sort of encoding. Its illogical and
inconvenient but it might work for some application. The term ASCII,
you so blithely toss out is both a character set (letters a-z, A-Z,
number 0-9, and the punctuation found on your keyboard, plus the null
character) plus an encoding (a=93 for instance - see it all at
http://www.asciitable.com).
3) One additional detail is how a character's number will be
represented on the machine. The convention for ASCII is to use one
byte per character. But you can only represent 256 distinct values in
a byte. Chinese has well over a thousand characters. So maybe each
character takes up 2 bytes.
4) As the US computer companies slowly corrupted the rest of the world
with the babel of computers, they shipped computers that had vendor
defined encodings that typically had ASCII as a base and then used
numbers greater than 127 (ASCII is representable in 7 bits) for the
other "strange characters". This worked until they discovered Asia.
5) Networks arrived and people began trying to exchange data - but the
number 214 might meant one character to a Polish guy and something
completely different to a Brazillian. Characters are drawn using fonts
- shape tables - pick the right shape table and 93 gets drawn as an
'a'. Pick dingbats and its some weird bugsplat. Basically, the
Brazillian guy's text looks like dingbats to the Polish guy because
their shape tables don't agree. Translation gets crazy. Notice that
NSString supports something like 20 different encodings.
6) Enter Unicode - the Borg of character encodings, it absorbs all the
existing character encodings into a huge number space. Over 65
thousand at last count. For extra credit, search the unicode standard
for the second representation of ascii - its in there twice. Note that
this is just an encoding, there are several different machine
representations for "unicode". The smallest fixed sized encoding that
covers all the characters in unicode is UCS-4 (4 bytes per character).
Holy cow! If you're an american trying to save some space, you might
be tempted to say BLOAT and to hell with Unicode. Hold on though.
Unless you're doing something with ancient Sumerian or some other
exotic character set, you can do pretty much all of the world's living
languages in UCS-2 (2 bytes per character). These are the fixed length
encodings. There is also a variable length encoding called UTF8. UTF8
uses 1 byte per character for ascii, 2 bytes per character for most
european languages, 3 bytes per character for the asian characters
(these are approximations). Its size efficient but the drawback is you
can only tell the index of the current character by reading from the
beginning and counting to your current location. A fixed width
character set allows you to calculate your index in constant time but
takes more space (sometimes).
NSString is *conceptually* a unicode string. Internally it does
everything in unicode. It only gets messy when you read and write the
string to some device (like a file or network). Suddenly, exactly what
format the bytes are in matters. Thats why you have to specify an
encoding when you go to NSData. Most of these encodings are now
considered legacy. The world is going to unicode (nevermind that
unicode sucks for asians - its the best thing going for now). The
preferred persistent format is UTF8.
So what encoding to pick? If you're reading an older file from outside
america, you need to find out what encoding it is. If you know the
language you only have a couple guesses to try - either its one of the
legacy encodings or its unicode. A couple heuristics will let you
guess.
If you are picking the encoding, use UTF8 always - even if you know in
your heart its ascii, use UTF8. Why? ASCII is a strict subset of UTF8
(except for Java's UTF8 stupidity but that's another story) and
everything will just work.
This stuff should be taught in schools.
On Monday, September 16, 2002, at 02:13 PM, Randall Crenshaw wrote:
--- Ondra Cada <email@hidden> wrote:
On Sunday, September 15, 2002, at 09:38 , Douglas
Davidson wrote:
It is impossible in principle to determine the encoding
used for an
arbitrary file. If the file contains sufficient
amounts of
natural-language text, then a human reader can usually
determine the
intended encoding, but it is easy to produce files for
which many
different encodings might have been used. Any method
of the sort you
propose would be no more than a reasonable guess.
Well, if the text is not entirely trivial and if there is
a spellchecker,
you can guess with a pretty low probability of a miss.
That is, of course, not a contradiction of what you
written -- for special
cases, there should *always* be a way for the user to
force the encoding
manually, for case the heuristics guessed wrong.
Um, ok - now I'm really confused. Just what is an
'encoding' anyway? I have been assuming that an encoding
is something like ASCII where ('A' == 0x0101) except that
in some other encoding, ('A' == 0x01000101 ) or something
like that. (Byte values not intended to be accurate.) So,
as a pure bytestream, there would be no internal clues, but
if you say "this is text" there should be some inherent
characteristics of the bytestream.
For example, if I read the file from disk into an NSString,
I can then convert to NSData using -fastestEncoding. This
would appear to solve the problem, except there is no easy
reverse conversion. If I read an NSData first, I am stuck
hitting the disk again to get it into a string. So how
does NSString pick an encoding? Why can it read from disk
but not from NSData?
Sorry - I'm sure it's apparent I'm at the edge of my
empirical understanding of things. Any books that cover
this stuff?
Thanks,
Randall
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.