Re: Information regarding UTF-8 code
Re: Information regarding UTF-8 code
- Subject: Re: Information regarding UTF-8 code
- From: Andrei Tchijov <email@hidden>
- Date: Mon, 29 Aug 2005 06:53:20 -0400
This blurb from http://www1.tip.nl/~t876506/utf8tbl.html
...
UTF-8 encoding
The proper way to convert between UCS-4 and UTF-8 is to use bitmask
(and, or) and bitshift operations. But if you would like to convert
only a couple of characters by hand or if your program development
environment (scripting language) does not support bit operations,
then integer division and multiplication can be used as follows.
From Unicode UCS-4 to UTF-8:
Start with the Unicode number expressed as a decimal number and call
this ud.
If ud <128 (7F hex) then UTF-8 is 1 byte long, the value of ud.
If ud >=128 and <=2047 (7FF hex) then UTF-8 is 2 bytes long.
byte 1 = 192 + (ud div 64)
byte 2 = 128 + (ud mod 64)
If ud >=2048 and <=65535 (FFFF hex) then UTF-8 is 3 bytes long.
byte 1 = 224 + (ud div 4096)
byte 2 = 128 + ((ud div 64) mod 64)
byte 3 = 128 + (ud mod 64)
If ud >=65536 and <=2097151 (1FFFFF hex) then UTF-8 is 4 bytes long.
byte 1 = 240 + (ud div 262144)
byte 2 = 128 + ((ud div 4096) mod 64)
byte 3 = 128 + ((ud div 64) mod 64)
byte 4 = 128 + (ud mod 64)
If ud >=2097152 and <=67108863 (3FFFFFF hex) then UTF-8 is 5 bytes long.
byte 1 = 248 + (ud div 16777216)
byte 2 = 128 + ((ud div 262144) mod 64)
byte 3 = 128 + ((ud div 4096) mod 64)
byte 4 = 128 + ((ud div 64) mod 64)
byte 5 = 128 + (ud mod 64)
If ud >=67108864 and <=2147483647 (7FFFFFFF hex) then UTF-8 is 6
bytes long.
byte 1 = 252 + (ud div 1073741824)
byte 2 = 128 + ((ud div 16777216) mod 64)
byte 3 = 128 + ((ud div 262144) mod 64)
byte 4 = 128 + ((ud div 4096) mod 64)
byte 5 = 128 + ((ud div 64) mod 64)
byte 6 = 128 + (ud mod 64)
The operation div means integer division and mod means the rest after
integer division.
For positive numbers a div b = int(a/b) and a mod b = (a/b-int(a/b))*b.
UTF-8 sequences of 4 bytes and longer are at the moment not supported
by the regular browsers.
The highest character position currently (Unicode 3.2) defined is
number 10FFFF hex (1114111 dec) in a 'private use' area. The highest
character with an actual glyph is number E007F hex (917631 dec), the
CANCEL TAG character.
From UTF-8 to Unicode UCS-4:
Let's take a UTF-8 byte sequence. The first byte in a new sequence
will tell us how long the sequence is. Let's call the subsequent
decimal bytes z y x w v u.
If z is between and including 0 - 127, then there is 1 byte z. The
decimal Unicode value ud = the value of z.
If z is between and including 192 - 223, then there are 2 bytes z y;
ud = (z-192)*64 + (y-128)
If z is between and including 224 - 239, then there are 3 bytes z y
x; ud = (z-224)*4096 + (y-128)*64 + (x-128)
If z is between and including 240 - 247, then there are 4 bytes z y x
w; ud = (z-240)*262144 + (y-128)*4096 + (x-128)*64 + (w-128)
If z is between and including 248 - 251, then there are 5 bytes z y x
w v; ud = (z-248)*16777216 + (y-128)*262144 + (x-128)*4096 + (w-128)
*64 + (v-128)
If z is 252 or 253, then there are 6 bytes z y x w v u; ud = (z-252)
*1073741824 + (y-128)*16777216 + (x-128)*262144 + (w-128)*4096 +
(v-128)*64 + (u-128)
If z = 254 or 255 then there is something wrong!
Example: take the decimal Unicode designation 8482 (decimal), which
is for the trademark sign. This number is larger than 2048, so we get
three numbers.
The first number is 224 + (8482 div 4096) = 224 + 2 = 226.
The second number is 128 + (8482 div 64) mod 64) = 128 + (132 mod 64)
= 128 + 4 = 132.
The third number is 128 + (8482 mod 64) = 128 + 34 = 162.
Now the other way round. We see the numbers 226, 132 and 162. What is
the decimal Unicode value?
In this case: (226-224)*4096+(132-128)*64+(162-128) = 8482.
And the conversion between hexadecimal and decimal? Come on, this is
not a math tutorial! In case you don't know, use a calculator.
...
Also
http://en.wikipedia.org/wiki/UTF-8
On Aug 29, 2005, at 06:20, Ratan Bhangale wrote:
Dear All,
We are building application on MAC OS using Cocoa framework and
Objective C.
All the messages which are to be displayed dynamically to the user
should be
encoded using UTF-8. We want some sample code which will guide us
about how
to convert plain text or Unicode to UTF-8.
Thanks and Regards
Ratan Bhangale
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden