Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?

Subject: Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
From: Andrew Farmer <email@hidden>
Date: Thu, 20 Jan 2005 13:37:36 -0800

On 20 Jan 2005, at 11:41, John Stiles wrote:

On Jan 20, 2005, at 11:42 AM, Andrew Farmer wrote:
On 20 Jan 2005, at 09:26, Stephane Sudre wrote:
In some e-mail subjects, people are using what is supposed to be UTF-8 encoded and is actually poor Unicode encoded.

For instance, instead of 0xC3A9 for eacute, you end up with 0xE9 (where it should be 0x00E9).

When you use NSString initWithBytes:length:encoding with the UTF-8 encoding as the paramter, you obtain nil. I understand this.

Now, the question is: is there a method in Cocoa to deal with stupidly encoded UTF-8 string?
What you're looking at is ISO8859-1 encoded text. Decode it as such and you'll be fine.

I'm pretty sure that there *should* be some easy way to detect whether text in the subject is encoded with ISO8859-1 or UTF-8. Look up the standards (if they exist).
The easiest detection method would be that NSString initWithBytes:length:encoding returned NULL :) :) Seriously, that's a pretty good clue that the text wasn't valid UTF8. At that point you get to guess its format, and Windows Latin-1 is as good a guess as any.

And on 20 Jan 2005, at 13:23, email@hidden wrote:

You would think so, but there isn't. UTF-8 uses multi-byte sequences to encode characters beyond ASCII. It signifies that a character is multi-byte by setting the high bit of the first character...

And, as if that weren't enough, on 20 Jan 2005, at 12:15, Clark Cox wrote:

While you can make some educated guesses, there is no foolproof way to
conclusively determine if text is UTF-8 vs. ISO-8859-1. The best guess
that you can make is already made by NSString for you: It couldn't
convert the text and returned nil.

Clarification: some easy way BASED ON MESSAGE HEADERS. It's certainly true that a failed conversion from UTF-8 indicates that the text isn't UTF-8, but a *successful* conversion from UTF-8 doesn't mean that the text was actually UTF-8.

For example, the character 'ö' (latin small letter o with diaeresis) - is encoded in ISO8859-1 as (0xF6), and in UTF-8 as (0xC3, 0xB6). However, the sequence (0xC3, 0xB6) is also valid ISO8859-1: it represents the string 'Ã¶' (latin capital letter A with tilde, pilcrow sign). Granted, this is an unlikely sequence, but it COULD conceivably appear in text. Like this.

I'm sure there's some way to detect the encoding of a subject line based on its format without "heuristic" methods like these. The internal mail message format uses low-bucky characters exclusively (encoding special characters with =XX sequences, if I remember correctly), so there's probably a special syntax for UTF-8...

Attachment: PGP.sig
Description: This is a digitally signed message part

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
  - From: John Stiles <email@hidden>

References:
	>Is there any support in Cocoa for stupidly encoded UTF-8 string? (From: Stephane Sudre <email@hidden>)
	>Re: Is there any support in Cocoa for stupidly encoded UTF-8 string? (From: Andrew Farmer <email@hidden>)
	>Re: Is there any support in Cocoa for stupidly encoded UTF-8 string? (From: John Stiles <email@hidden>)

Prev by Date: Re: Binding problem
Next by Date: Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
Previous by thread: Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
Next by thread: Re: Is there any support in Cocoa for stupidly encoded UTF-8 string?
Index(es):
- Date
- Thread