Re: wchar_t and printf not working

28 Mar 2005

      site_archiver@lists.apple.com
Delivered-To: darwin-dev@lists.apple.com
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=tkbaejiv/PAxipEJEoplmcGkk/kWjjeCUFIX8j/WYkqunu6qEdxAxaChNZ9vbtUKoSQ7xfNTMWhoCa0pTXkt1+eku7nmqDVNveTwwB0LdYoqmFesIb7loTl51sPa08qkwvFpmb2K0VTjtcGbX8E8w2sy/G0sNiZYALEs9Vo5VXI=

On Mon, 28 Mar 2005 23:13:39 -0500 (EST), Michael B Allen
<mba2000@ioplex.com> wrote:
...
Clark Cox said:
...
On Mon, 28 Mar 2005 14:09:11 -0500, Michael B Allen <mba2000@ioplex.com>
wrote:
...
On Mon, 28 Mar 2005 08:39:25 -0500
Clark Cox <clarkcox3@gmail.com> wrote:
...
Actually, you will *never* see UTF-8 with more than 4 octets per
codepoint. Period. That is the way that UTF-8 is defined. If you see a
5 or 6 octet character, then you are not reading UTF-8 data.
This is incorrect. Please read the first sentence in section 2 of
RFC 2279
RFC 2279 has been obsoleted by RFC 3629 for years now:
http://www.ietf.org/rfc/rfc3629.txt
First, this is dated 16 months ago. Pardon me for googling "utf-8 rfc" but
don't be a twit - it's not "years".
Is there really any need for the namecalling?
...
...
From said RFC:
[snip]
3.  UTF-8 definition
UTF-8 is defined by the Unicode Standard [UNICODE].  Descriptions and
   formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
   accessible range) are encoded using sequences of 1 to 4 octets.
This just says that range takes 4 octects. Well yeah. Duh. As I stated
previously is the only practical range of values one would ever see but
technically UTF-8 can encode 6 bytes.
Not as defined by the Unicode Standard.
...
RFC 3269 mentions this in the Security Considerations section:
Another security issue occurs when encoding to UTF-8: the ISO/IEC
   10646 description of UTF-8 allows encoding character numbers up to
   U+7FFFFFFF, yielding sequences of up to 6 bytes.  There is therefore
   a risk of buffer overflow if the range of character numbers is not
   explicitly limited to U+10FFFF or if buffer sizing doesn't take into
   account the possibility of 5- and 6-byte sequences.
Man, it really doesn't pay to be pedantic these days.
You missed the quoted text "UTF-8 is defined by the Unicode Standard
[UNICODE]." You also missed the part where it stated "The
authoritative definition of UTF-8 is in [UNICODE].  This grammar is
believed to describe the same thing Unicode describes, but does not
claim to be authoritative." So, even the RFC defers to the Unicode
standard.

In the Unicode standard chapter 3 - 9, figure 3-6, there is a listing
of the ranges of acceptable values for *ALL* valid UTF-8 octets. There
are no valid values for a 5th or 6th byte. Period. In the same
section, it states "the following byte values are disallowed in UTF-8:
C0–C1, F5–FF." In order for a so called 5 or 6 octet UTF-8 encoded
codepoint, the first byte would have to take on a value in the range
F5-FF, which the quoted text explicitly disallows. Nowhere in the
Unicode standard is there any allowance for a 5 or 6 octet UTF-8
encoding of a code point.

--
Clark S. Cox III
clarkcox3@gmail.com
http://www.livejournal.com/users/clarkcox3/
http://homepage.mac.com/clarkcox3/
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (Darwin-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl...

This email sent to site_archiver@lists.apple.com