Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Data to String: what encoding?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Data to String: what encoding?

Subject: Re: Data to String: what encoding?
From: Todd Blanchard <email@hidden>
Date: Mon, 16 Sep 2002 15:29:12 +0200

WAKE UP - TIME TO DIE!

OK, not exactly, but I'm guessing you're an american. Its OK, so am I. So there's hope for you yet.

Character Encoding 101 starts NOW.

0) A character is an abstract idea. It has nothing to do with what you see on the screen. The letter D is the letter D no matter if its block text or drawn like the Detroit Tigers do on their ball caps.

1) A character SET is a set of characters (duh). The set defines the entire alphabet, has nothing to do with computers, and is a completely abstract entity defined as a set of abstract entities called characters. { A B C Z & ^ < } is a set of characters. The english alphabet is a character set. The hebrew alphabet is a character set. The ,.?;:!- characters make up the set we commonly refer to as punctuation.

2) A character encoding is description of a representation of a character set in a computer. If I say A=4, B=8, C=3.14, Z=-29, &=44, ^=13, <=-29000. Thats a sort of encoding. Its illogical and inconvenient but it might work for some application. The term ASCII, you so blithely toss out is both a character set (letters a-z, A-Z, number 0-9, and the punctuation found on your keyboard, plus the null character) plus an encoding (a=93 for instance - see it all at http://www.asciitable.com).

3) One additional detail is how a character's number will be represented on the machine. The convention for ASCII is to use one byte per character. But you can only represent 256 distinct values in a byte. Chinese has well over a thousand characters. So maybe each character takes up 2 bytes.

4) As the US computer companies slowly corrupted the rest of the world with the babel of computers, they shipped computers that had vendor defined encodings that typically had ASCII as a base and then used numbers greater than 127 (ASCII is representable in 7 bits) for the other "strange characters". This worked until they discovered Asia.

5) Networks arrived and people began trying to exchange data - but the number 214 might meant one character to a Polish guy and something completely different to a Brazillian. Characters are drawn using fonts - shape tables - pick the right shape table and 93 gets drawn as an 'a'. Pick dingbats and its some weird bugsplat. Basically, the Brazillian guy's text looks like dingbats to the Polish guy because their shape tables don't agree. Translation gets crazy. Notice that NSString supports something like 20 different encodings.

6) Enter Unicode - the Borg of character encodings, it absorbs all the existing character encodings into a huge number space. Over 65 thousand at last count. For extra credit, search the unicode standard for the second representation of ascii - its in there twice. Note that this is just an encoding, there are several different machine representations for "unicode". The smallest fixed sized encoding that covers all the characters in unicode is UCS-4 (4 bytes per character). Holy cow! If you're an american trying to save some space, you might be tempted to say BLOAT and to hell with Unicode. Hold on though. Unless you're doing something with ancient Sumerian or some other exotic character set, you can do pretty much all of the world's living languages in UCS-2 (2 bytes per character). These are the fixed length encodings. There is also a variable length encoding called UTF8. UTF8 uses 1 byte per character for ascii, 2 bytes per character for most european languages, 3 bytes per character for the asian characters (these are approximations). Its size efficient but the drawback is you can only tell the index of the current character by reading from the beginning and counting to your current location. A fixed width character set allows you to calculate your index in constant time but takes more space (sometimes).

NSString is *conceptually* a unicode string. Internally it does everything in unicode. It only gets messy when you read and write the string to some device (like a file or network). Suddenly, exactly what format the bytes are in matters. Thats why you have to specify an encoding when you go to NSData. Most of these encodings are now considered legacy. The world is going to unicode (nevermind that unicode sucks for asians - its the best thing going for now). The preferred persistent format is UTF8.

So what encoding to pick? If you're reading an older file from outside america, you need to find out what encoding it is. If you know the language you only have a couple guesses to try - either its one of the legacy encodings or its unicode. A couple heuristics will let you guess.

If you are picking the encoding, use UTF8 always - even if you know in your heart its ascii, use UTF8. Why? ASCII is a strict subset of UTF8 (except for Java's UTF8 stupidity but that's another story) and everything will just work.

This stuff should be taught in schools.

On Monday, September 16, 2002, at 02:13 PM, Randall Crenshaw wrote:

--- Ondra Cada <email@hidden> wrote:

On Sunday, September 15, 2002, at 09:38 , Douglas
Davidson wrote:

It is impossible in principle to determine the encoding
used for an
arbitrary file. If the file contains sufficient
amounts of
natural-language text, then a human reader can usually
determine the
intended encoding, but it is easy to produce files for
which many
different encodings might have been used. Any method
of the sort you
propose would be no more than a reasonable guess.

Well, if the text is not entirely trivial and if there is
a spellchecker,
you can guess with a pretty low probability of a miss.

That is, of course, not a contradiction of what you
written -- for special
cases, there should *always* be a way for the user to
force the encoding
manually, for case the heuristics guessed wrong.

Um, ok - now I'm really confused. Just what is an
'encoding' anyway? I have been assuming that an encoding
is something like ASCII where ('A' == 0x0101) except that
in some other encoding, ('A' == 0x01000101 ) or something
like that. (Byte values not intended to be accurate.) So,
as a pure bytestream, there would be no internal clues, but
if you say "this is text" there should be some inherent
characteristics of the bytestream.

For example, if I read the file from disk into an NSString,
I can then convert to NSData using -fastestEncoding. This
would appear to solve the problem, except there is no easy
reverse conversion. If I read an NSData first, I am stuck
hitting the disk again to get it into a string. So how
does NSString pick an encoding? Why can it read from disk
but not from NSData?

Sorry - I'm sure it's apparent I'm at the edge of my
empirical understanding of things. Any books that cover
this stuff?

Thanks,
Randall
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

References:
	>Re: Data to String: what encoding? (From: Randall Crenshaw <email@hidden>)

Prev by Date: Re: Data to String: what encoding?
Next by Date: [newbie] 10.2 Build problems
Previous by thread: Re: Data to String: what encoding?
Next by thread: [newbie] /r invoking button instead of going to textfield?
Index(es):
- Date
- Thread