site_archiver@lists.apple.com Delivered-To: darwin-dev@lists.apple.com Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=tkbaejiv/PAxipEJEoplmcGkk/kWjjeCUFIX8j/WYkqunu6qEdxAxaChNZ9vbtUKoSQ7xfNTMWhoCa0pTXkt1+eku7nmqDVNveTwwB0LdYoqmFesIb7loTl51sPa08qkwvFpmb2K0VTjtcGbX8E8w2sy/G0sNiZYALEs9Vo5VXI= On Mon, 28 Mar 2005 23:13:39 -0500 (EST), Michael B Allen <mba2000@ioplex.com> wrote:
Clark Cox said:
On Mon, 28 Mar 2005 14:09:11 -0500, Michael B Allen <mba2000@ioplex.com> wrote:
On Mon, 28 Mar 2005 08:39:25 -0500 Clark Cox <clarkcox3@gmail.com> wrote:
Actually, you will *never* see UTF-8 with more than 4 octets per codepoint. Period. That is the way that UTF-8 is defined. If you see a 5 or 6 octet character, then you are not reading UTF-8 data.
This is incorrect. Please read the first sentence in section 2 of RFC 2279
RFC 2279 has been obsoleted by RFC 3629 for years now: http://www.ietf.org/rfc/rfc3629.txt
First, this is dated 16 months ago. Pardon me for googling "utf-8 rfc" but don't be a twit - it's not "years".
Is there really any need for the namecalling?
From said RFC: [snip] 3. UTF-8 definition
UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.
This just says that range takes 4 octects. Well yeah. Duh. As I stated previously is the only practical range of values one would ever see but technically UTF-8 can encode 6 bytes.
Not as defined by the Unicode Standard.
RFC 3269 mentions this in the Security Considerations section:
Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.
Man, it really doesn't pay to be pedantic these days.
You missed the quoted text "UTF-8 is defined by the Unicode Standard [UNICODE]." You also missed the part where it stated "The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative." So, even the RFC defers to the Unicode standard. In the Unicode standard chapter 3 - 9, figure 3-6, there is a listing of the ranges of acceptable values for *ALL* valid UTF-8 octets. There are no valid values for a 5th or 6th byte. Period. In the same section, it states "the following byte values are disallowed in UTF-8: C0–C1, F5–FF." In order for a so called 5 or 6 octet UTF-8 encoded codepoint, the first byte would have to take on a value in the range F5-FF, which the quoted text explicitly disallows. Nowhere in the Unicode standard is there any allowance for a 5 or 6 octet UTF-8 encoding of a code point. -- Clark S. Cox III clarkcox3@gmail.com http://www.livejournal.com/users/clarkcox3/ http://homepage.mac.com/clarkcox3/ _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl... This email sent to site_archiver@lists.apple.com