Re: iso-8859-1 over UTF8 (was: Re: cString deprecated!)
Re: iso-8859-1 over UTF8 (was: Re: cString deprecated!)
- Subject: Re: iso-8859-1 over UTF8 (was: Re: cString deprecated!)
- From: Chris Ridd <email@hidden>
- Date: Thu, 05 Sep 2002 11:23:19 +0100
On 5/9/02 8:59 am, Malte Tancred <email@hidden> wrote:
>
On wednesday, sep 4, 2002, at 15:09 Europe/Stockholm, Clark S. Cox III
>
wrote:
>
> No you wouldn't. There is no way that any byte in a multi-byte
>
> UTF-8
>
> character could be confused for an ASCII character, because they
>
> always have
>
> the high bit set. For instance, there is no way that you can make a
>
> multi-byte UTF-8 character that looks like "%d".
>
>
I believe there is something called "overlong representation". For
>
example, a slash (/) can be represented by for example a 3 byte UTF-8
>
sequence. The encoding/algorithm per se allows this.
Sort of correct - you aren't permitted to encode this, but it was slightly
vague about having to decode it.
>
This behavior is forbidden though, published in an extension to the
>
original spec I think.
Correct! The Unicode consortium's technical report 27 includes a "UTF-8
Corrigendum" that prohibits the interpretation of non-shortest forms of BMP
characters.
If you're desperately interested in this stuff and/or have copious spare
time ;-) the report is at:
<
http://www.unicode.org/unicode/reports/tr27/>
I recall Microsoft falling foul of this problem in IIS.
Cheers,
Chris
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.