Re: wchar_t and printf not working
Re: wchar_t and printf not working
- Subject: Re: wchar_t and printf not working
- From: Clark Cox <email@hidden>
- Date: Tue, 29 Mar 2005 01:00:53 -0500
On Mon, 28 Mar 2005 23:13:39 -0500 (EST), Michael B Allen
<email@hidden> wrote:
> Clark Cox said:
> > On Mon, 28 Mar 2005 14:09:11 -0500, Michael B Allen <email@hidden>
> > wrote:
> >> On Mon, 28 Mar 2005 08:39:25 -0500
> >> Clark Cox <email@hidden> wrote:
> >>
> >> > Actually, you will *never* see UTF-8 with more than 4 octets per
> >> > codepoint. Period. That is the way that UTF-8 is defined. If you see a
> >> > 5 or 6 octet character, then you are not reading UTF-8 data.
> >>
> >> This is incorrect. Please read the first sentence in section 2 of
> >> RFC 2279
> >
> > RFC 2279 has been obsoleted by RFC 3629 for years now:
> > http://www.ietf.org/rfc/rfc3629.txt
>
> First, this is dated 16 months ago. Pardon me for googling "utf-8 rfc" but
> don't be a twit - it's not "years".
Is there really any need for the namecalling?
> > From said RFC:
> > [snip]
> > 3. UTF-8 definition
> >
> > UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and
> > formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
> >
> > In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
> > accessible range) are encoded using sequences of 1 to 4 octets.
>
> This just says that range takes 4 octects. Well yeah. Duh. As I stated
> previously is the only practical range of values one would ever see but
> technically UTF-8 can encode 6 bytes.
Not as defined by the Unicode Standard.
> RFC 3269 mentions this in the Security Considerations section:
>
> Another security issue occurs when encoding to UTF-8: the ISO/IEC
> 10646 description of UTF-8 allows encoding character numbers up to
> U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
> a risk of buffer overflow if the range of character numbers is not
> explicitly limited to U+10FFFF or if buffer sizing doesn't take into
> account the possibility of 5- and 6-byte sequences.
>
> Man, it really doesn't pay to be pedantic these days.
You missed the quoted text "UTF-8 is defined by the Unicode Standard
[UNICODE]." You also missed the part where it stated "The
authoritative definition of UTF-8 is in [UNICODE]. This grammar is
believed to describe the same thing Unicode describes, but does not
claim to be authoritative." So, even the RFC defers to the Unicode
standard.
In the Unicode standard chapter 3 - 9, figure 3-6, there is a listing
of the ranges of acceptable values for *ALL* valid UTF-8 octets. There
are no valid values for a 5th or 6th byte. Period. In the same
section, it states "the following byte values are disallowed in UTF-8:
C0–C1, F5–FF." In order for a so called 5 or 6 octet UTF-8 encoded
codepoint, the first byte would have to take on a value in the range
F5-FF, which the quoted text explicitly disallows. Nowhere in the
Unicode standard is there any allowance for a 5 or 6 octet UTF-8
encoding of a code point.
--
Clark S. Cox III
email@hidden
http://www.livejournal.com/users/clarkcox3/
http://homepage.mac.com/clarkcox3/
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden