Re: Unicode search
Re: Unicode search
- Subject: Re: Unicode search
- From: Helmut Fuchs <email@hidden>
- Date: Fri, 21 Mar 2003 14:15:40 +0100
At 12:16 Uhr +0000 21.03.2003, John Delacour wrote:
By definition UTF-16 is two bytes. 256 * 256 = 65536, so that's the
limit. In practice there are fewer code points assigned than that.
John, you really disappoint me. Before pointing a "by definition" at
someone, you should read the definition, I guess.
The numbers in the UTF only tell how big the chunks are, that the
Unicode data is stored in. UTF-8 is 8 bit units, UTF-16 is 16 bit
units and UTF-32 stands for 32 bit units.
In RFC2279 you can find, that a UTF-8 character can be made up of up
to 6 units of 8 bits. And this link says, that the Unicode standard
allows for 21 bits to encode characters:
<
http://www.unicode.org/faq/utf_bom.html#9>. 21 bits clearly don't
fit into two bytes.
Of course, the most common characters are coded into 2 bytes under UTF-16.
All of them. Give me an example of a character in UTF-16 that is
not two bytes.
Please read more about Unicode before making such claims. For example
an accented character in decomposed form takes up two UTF-16 units,
but AFAIK it should be treated as a _single_ character. And as said
before: the current Unicode standard allows for 21 bits of character
encoding - to allow this, UTF-16 implements a mechanism called
"surrogate pairs":
<
http://www.unicode.org/faq/utf_bom.html#6>
And this says something about ignoring surrogates altogether (bad
idea, but I know you were thinking of it already ;-):
<
http://www.unicode.org/faq/utf_bom.html#17>
Best regards,
Helmut
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.