Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
- Subject: Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
- From: email@hidden
- Date: Thu, 20 Jan 2005 13:23:49 -0800
On Jan 20, 2005, at 11:42 AM, Andrew Farmer wrote:
What you're looking at is ISO8859-1 encoded text. Decode it as such
and you'll be fine.
I'm pretty sure that there *should* be some easy way to detect whether
text in the subject is encoded with ISO8859-1 or UTF-8. Look up the
standards (if they exist).
You would think so, but there isn't. UTF-8 uses multi-byte sequences
to encode characters beyond ASCII. It signifies that a character is
multi-byte by setting the high bit of the first character.
So, when you are looking at an ISO8859-1 string and run into é, it is a
single byte [0xe9] that happens to have the high bit set. Now, you
could look at the next character and decide that 0xe90xa9 (or whatever)
could not possibly be correct because you just dropped some random
asian/arabic character into the middle of a block of text that is
otherwise composed of characters from the 'western' alphabet....
But it'll be a guess at best.
It seems that this exact problem comes up about once every six months
for me. So far, I have had to deal with it in Python, Java, when
dealing with XML (in a couple of different languages) and Objective-C.
Most commonly, I have had to deal with XML documents that claim to be
UTF-8, but have ISO8859-1 encoded accents throughout the PCDATA blocks
(causing many parsers to barf).
Fortunately, once you determine that you are looking at a UTF-8 stream
where someone screwed up and shoved ISO8859-1 characters into it, the
conversion is easy assuming you can limit the scope of conversion to
characters that are actually hosed.
To convert é from ISO8859-1 to UTF-8, you would use an algorithm like
the following. Note that this is a totally braindead Python script
that I one-off'd to solve a problem at hand. It reads and writes
characters one-by-one. About as inefficient as can be imagined, but
very straightforward and it did the job.
import sys
while 1:
x = sys.stdin.read(1)
if x == '': break
if ord(x) < 0x80:
sys.stdout.write(x)
else:
sys.stdout.write( chr( 0xC0 | (ord(x) >> 6)))
sys.stdout.write( chr( 0x80 | (ord(x) & 0x3f)))
b.bum
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden