site_archiver@lists.apple.com Delivered-To: darwin-dev@lists.apple.com Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=oRmIVb3bJBb2Co4JdHj0OSIoH2g7iM3huiEIv7xnKwl6mXkpZ3uBZLftY2RBd0AVEYqoMjbRHcTfC9+DwJFuRsUi19QLa7RVw9BdLKX3jmS1+nxqRAEYTOEZlH9gMzPNgYoBiHBxK+iF6FJCZgdXrLw3axC2jJpZvz9XN4oQq8A= On Mon, 28 Mar 2005 03:51:50 -0500, Michael B Allen <mba2000@ioplex.com> wrote:
On Mon, 28 Mar 2005 11:36:10 +0400 Alexey Proskuryakov <ap-carbon@rambler.ru> wrote:
Each character may occupy between 1 and 6 bytes [1].
More precisely, between 1 and 4: <http://www.unicode.org/faq/utf_bom.html#30>.
At risk of being pedantic this is just talking about how to convert a UTF-16 character into a UTF-8 one. Because UTF-16 with a surrogate can only represent 21 bits of the Unicode code space only 4 bytes is necessary to encode any character in UTF-8.
Unicode *only hase* 21 bits of code space, even UTF-32 only uses 21 bits.
But UTF-8 can encode the full 31 bit code space which needs at most 6 bytes to be represented in UTF-8. But unless you're doing Klingon you'll never actually see more than 4.
Actually, you will *never* see UTF-8 with more than 4 octets per codepoint. Period. That is the way that UTF-8 is defined. If you see a 5 or 6 octet character, then you are not reading UTF-8 data. -- Clark S. Cox III clarkcox3@gmail.com http://www.livejournal.com/users/clarkcox3/ http://homepage.mac.com/clarkcox3/ _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl... This email sent to site_archiver@lists.apple.com