Re: Unicode search
Re: Unicode search
- Subject: Re: Unicode search
- From: John Delacour <email@hidden>
- Date: Fri, 21 Mar 2003 14:09:46 +0000
- Mac-eudora-version: 6.0a11
At 2:15 pm +0100 21/3/03, Helmut Fuchs wrote:
John, you really disappoint me. Before pointing a "by definition" at
someone, you should read the definition, I guess.
The numbers in the UTF only tell how big the chunks are, that the
Unicode data is stored in. UTF-8 is 8 bit units, UTF-16 is 16 bit
units and UTF-32 stands for 32 bit units.
Of course. That's what I said.
In RFC2279 you can find, that a UTF-8 character can be made up of up
to 6 units of 8 bits.
I know that. We are not talking about UTF-8.
And this link says, that the Unicode standard allows for 21 bits to
encode characters: <http://www.unicode.org/faq/utf_bom.html#9>. 21
bits clearly don't fit into two bytes.
That link says nothing of the kind. It says:
"both Unicode and ISO 10646 have policies in place that formally
limit even the UTF-32 encoding form to the integer range that can be
expressed with UTF-16 (or 21 significant bits)."
Please read more about Unicode before making such claims.
I seem to understand it better than you...
For example an accented character in decomposed form takes up two
UTF-16 units, but AFAIK it should be treated as a _single_ character.
The glyph for u with umlaut can be represented with a single
character or with TWO characters named COMBINING DIAERESIS (U+0303) +
LATIN SMALL LETTER U (U+0075)
OR
with a SINGLE character named LATIN SMALL LETTER U WITH DIAERESIS (U+00FC)
In the first case the fact that the combining diaeresis character
does not move the carriage does not mean that it is not a separate
character.
A single GLYPH may be composed of more than one UTF-16 characters.
You are confusing "glyph" with "character". They are two different
animals, as you will learn if you read the specification.
And as said before: the current Unicode standard allows for 21 bits
of character encoding - to allow this, UTF-16 implements a mechanism
called "surrogate pairs":
And a pair of UTF-16 characters is two characters.
JD
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.