• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: AS and Unicode characters
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AS and Unicode characters


  • Subject: Re: AS and Unicode characters
  • From: "Mark J. Reed" <email@hidden>
  • Date: Fri, 5 Jan 2007 09:27:37 -0500

On 1/5/07, KOENIG Yvan <email@hidden> wrote:
Alas, when pasting in TextEdit I discovered that the code only
grabbed the first two bytes 01D1 giving to me an infamous
(LATIN CAPITAL LETTER O WITH CARON whose code is (01D1)
when I wanted a
MUSICAL SYMBOL DOUBLE FLAT (01D12B).

I'm pretty sure the native encoding used by OS X is UTF-16, which means that in order to deal with code points above U+FFFF, you have to use surrogate pairs. Basically, you generate two separate code points representing a single scalar value; each of the code points is essentially a single "digit" in base 1024, meaning you can represent 1024x1024=1,048,576 characters that way. Add the 65,536 characters in the basic multilingual plane and you get the full Unicode repertoire of 1,114,412 characters.

The UTF-16 representation of U+1D12B consists of U+D834 followed by
U+DD2B.    Here's how you get that from the scalar value (1D12B hex =
119,083 decimal).

1. Subtract 10000 hex = 65,536 decimal; this is the scalar value
represented by a surrogate pair number of zero.  This is easy in the
hexadecimal version - just drop the leading 1.  The result is D12B.

2. To convert to "base 1024", just divide and keep the quotient and
remainder.  D12B hex = 53547 decimal.  Divide by 1024 and you get 52
with a remainder of 299.  So the two "base 1024 digits" we want to
output are 52 and 299 decimal; hex will be easier to work with, so
that's 34 and 12B

3. The first digit comes from the high surrogates  area, which starts
at D800.  Just add: D800 + 34 = D834.  That's the first "character"
output.

4. The second digit comes from the low surrogates area at DC00.  DC00
+ 12B = DD2B.

--
Mark J. Reed <email@hidden>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/mailman//archives/applescript-users

This email sent to email@hidden
  • Follow-Ups:
    • Re: AS and Unicode characters
      • From: KOENIG Yvan <email@hidden>
References: 
 >AS and Unicode characters (From: KOENIG Yvan <email@hidden>)

  • Prev by Date: AS and Unicode characters
  • Next by Date: Re: Quirks of Annotate script
  • Previous by thread: AS and Unicode characters
  • Next by thread: Re: AS and Unicode characters
  • Index(es):
    • Date
    • Thread