On Jul 9, 2013, at 11:45 PM, Kaydell Leavitt <
email@hidden> wrote:
Does this make sense? The id of the character is only 233 but the percent-encoding makes it look like the accented é takes two bytes to encode.
That's correct for UTF-8. UTF-8 uses one byte to encode codepoints 0 to 127 (ASCII), two bytes to encode codepoints 128 to 2047, three bytes to encode codepoints 2048 to 65535, and four bytes for codepoints 65536 and up.
The range of characters that can be encoded in two bytes includes Latin-1 (which is where you'll find é). Even though Latin-1 codepoints are small enough to fit in a single byte, they can't be coded that way and still leave room to encode the fact of it being only one byte.
If the binary representation of a codepoint is xxx...xxx, UTF-8 uses the shortest of the following sequences that has enough xs to contain the value:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxxx
233 in binary is 11100111, which UTF-8 breaks into 00011_100111 and encodes as 110_00011 10_100111.
This isn't a consequence of using combining marks. If you decomposed é into LATIN SMALL LETTER E (69) followed by COMBINING ACUTE ACCENT (301), UTF-8 would need three bytes: one for the e (which wouldn't need to be URL-encoded) plus two more for the accent.
-Ron Hunsinger