• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Unicode search
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode search


  • Subject: Re: Unicode search
  • From: John Delacour <email@hidden>
  • Date: Fri, 21 Mar 2003 14:09:46 +0000
  • Mac-eudora-version: 6.0a11

At 2:15 pm +0100 21/3/03, Helmut Fuchs wrote:

John, you really disappoint me. Before pointing a "by definition" at someone, you should read the definition, I guess.

The numbers in the UTF only tell how big the chunks are, that the Unicode data is stored in. UTF-8 is 8 bit units, UTF-16 is 16 bit units and UTF-32 stands for 32 bit units.

Of course. That's what I said.

In RFC2279 you can find, that a UTF-8 character can be made up of up to 6 units of 8 bits.

I know that. We are not talking about UTF-8.

And this link says, that the Unicode standard allows for 21 bits to encode characters: <http://www.unicode.org/faq/utf_bom.html#9>. 21 bits clearly don't fit into two bytes.

That link says nothing of the kind. It says:

"both Unicode and ISO 10646 have policies in place that formally limit even the UTF-32 encoding form to the integer range that can be expressed with UTF-16 (or 21 significant bits)."

Please read more about Unicode before making such claims.

I seem to understand it better than you...

For example an accented character in decomposed form takes up two UTF-16 units, but AFAIK it should be treated as a _single_ character.

The glyph for u with umlaut can be represented with a single character or with TWO characters named COMBINING DIAERESIS (U+0303) + LATIN SMALL LETTER U (U+0075)

OR

with a SINGLE character named LATIN SMALL LETTER U WITH DIAERESIS (U+00FC)

In the first case the fact that the combining diaeresis character does not move the carriage does not mean that it is not a separate character.

A single GLYPH may be composed of more than one UTF-16 characters. You are confusing "glyph" with "character". They are two different animals, as you will learn if you read the specification.

And as said before: the current Unicode standard allows for 21 bits of character encoding - to allow this, UTF-16 implements a mechanism called "surrogate pairs":

And a pair of UTF-16 characters is two characters.

JD
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

  • Follow-Ups:
    • Re: Unicode search
      • From: Helmut Fuchs <email@hidden>
References: 
 >Re: the Holy Grail of AppleScript lists (From: Paul Berkowitz <email@hidden>)
 >Unicode search [was Re: the Holy Grail of AppleScript lists] (From: Helmut Fuchs <email@hidden>)
 >Re: Unicode search [was Re: the Holy Grail of AppleScript lists] (From: John Delacour <email@hidden>)
 >Re: Unicode search [was Re: the Holy Grail of AppleScript lists] (From: Emmanuel <email@hidden>)
 >Re: Unicode search [was Re: the Holy Grail of AppleScript lists] (From: John Delacour <email@hidden>)
 >Re: Unicode search (From: Helmut Fuchs <email@hidden>)

  • Prev by Date: Re: Unicode search
  • Next by Date: Re: Unicode search
  • Previous by thread: Re: Unicode search
  • Next by thread: Re: Unicode search
  • Index(es):
    • Date
    • Thread