Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

Subject: Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
From: email@hidden
Date: Thu, 20 Jan 2005 13:23:49 -0800

On Jan 20, 2005, at 11:42 AM, Andrew Farmer wrote:

What you're looking at is ISO8859-1 encoded text. Decode it as such and you'll be fine.

I'm pretty sure that there *should* be some easy way to detect whether text in the subject is encoded with ISO8859-1 or UTF-8. Look up the standards (if they exist).

You would think so, but there isn't. UTF-8 uses multi-byte sequences to encode characters beyond ASCII. It signifies that a character is multi-byte by setting the high bit of the first character.

So, when you are looking at an ISO8859-1 string and run into é, it is a single byte [0xe9] that happens to have the high bit set. Now, you could look at the next character and decide that 0xe90xa9 (or whatever) could not possibly be correct because you just dropped some random asian/arabic character into the middle of a block of text that is otherwise composed of characters from the 'western' alphabet....

But it'll be a guess at best.

It seems that this exact problem comes up about once every six months for me. So far, I have had to deal with it in Python, Java, when dealing with XML (in a couple of different languages) and Objective-C. Most commonly, I have had to deal with XML documents that claim to be UTF-8, but have ISO8859-1 encoded accents throughout the PCDATA blocks (causing many parsers to barf).

Fortunately, once you determine that you are looking at a UTF-8 stream where someone screwed up and shoved ISO8859-1 characters into it, the conversion is easy assuming you can limit the scope of conversion to characters that are actually hosed.

To convert é from ISO8859-1 to UTF-8, you would use an algorithm like the following. Note that this is a totally braindead Python script that I one-off'd to solve a problem at hand. It reads and writes characters one-by-one. About as inefficient as can be imagined, but very straightforward and it did the job.

import sys

while 1:
    x = sys.stdin.read(1)
    if x == '': break
    if ord(x) < 0x80:
        sys.stdout.write(x)
    else:
        sys.stdout.write( chr( 0xC0 | (ord(x) >> 6)))
        sys.stdout.write( chr( 0x80 | (ord(x) & 0x3f)))

b.bum

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
From: Bob Ippolito <email@hidden>


References:  
  >Is there any support in Cocoa for stupidly encoded UTF-8 string? (From: Stephane Sudre <email@hidden>)
  >Re: Is there any support in Cocoa for stupidly encoded UTF-8 string? (From: Andrew Farmer <email@hidden>)




Prev by Date:
Re: Binding problem

Next by Date:
Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

Previous by thread:
Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

Next by thread:
Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

Index(es):

Date
Thread