Re: Producing Unicode-only characters
Re: Producing Unicode-only characters
- Subject: Re: Producing Unicode-only characters
- From: "Mark J. Reed" <email@hidden>
- Date: Wed, 26 Oct 2005 14:27:07 -0400
On 10/26/05,
bill <
email@hidden> wrote:
I'm surprised to see the glyph of U+28CCA, the meaning of this
character is not, ... well, suitable for public discussion.
Well, now you've gone and piqued our curiosity. :) What does it mean?
BTW, «data utxt00028CCA» does not produce code point U+28CCA, you may
compare this one:
Right. That's just two characters, U+0002 and U+8CCA. As I
said in my earlier message, "unicode text" is stored using
UTF-16. Which means ou have to use the surrogates to get
characters above U+FFFF.
Anyone know the mechanism why & now code point beyond U+FFFF is
composed by hex values?
Yes: surrogate pairs. The code points in the range U+D800 through
U+DFFF are reserved for this purpose. Essentially, characters
above U+FFFF are represented as a two-digit base-1024 number whose
value is the difference between the desired character and the first
code point that doesn't fit in 16 bits. In other words,
U+10000 = 65536 decimal is stored as 0, U+10001 is stored as 1,
etc. The highest representable value is therefore the sum of hex
10000 + FFFFF = U+10FFFF, which in decimal is the nicely palindromic
number 1114111.
The first (high) digit is chosen from U+D800 through U+DBFF (D800 = 0,
D801 = 1, ..., DBFE = 1022, DBFF = 1023) and the second (low) digit is
chosen from U+DC00 through U+DFFFF the same way.
So, for our unspeakable character U+28CCA:
1. Get the Unicode scalar value, which you do by just converting from
hexadecimal into a number we can do math on. "28CCA" is
hexadecimal for 167114.
2. Subtract 65536: 167114 - 65536 = 101578
3. Divide by 1024, yielding an integer quotient and a remainder: 101578 / 1024 = 24 with remainder 412
4. The quotient is the high surrogate value. Add it to U+D800 and
store that character. 24 decimal = 18 hex, so the first character
is U+D818.
5. The remainder is the low surrogate value. Add it to U+DC00 and
store that character. 412 decimal = 19C hex, so the second
character is U+DD9C.
So («data utxtDC00DD9C» as Unicode text) should yield U+28CCA.
--
Mark J. Reed <
email@hidden>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden