• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Information regarding UTF-8 code
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Information regarding UTF-8 code


  • Subject: Re: Information regarding UTF-8 code
  • From: Andrei Tchijov <email@hidden>
  • Date: Mon, 29 Aug 2005 06:53:20 -0400

This blurb from http://www1.tip.nl/~t876506/utf8tbl.html
...
UTF-8 encoding
The proper way to convert between UCS-4 and UTF-8 is to use bitmask (and, or) and bitshift operations. But if you would like to convert only a couple of characters by hand or if your program development environment (scripting language) does not support bit operations, then integer division and multiplication can be used as follows.


From Unicode UCS-4 to UTF-8:
Start with the Unicode number expressed as a decimal number and call this ud.


If ud <128 (7F hex) then UTF-8 is 1 byte long, the value of ud.

If ud >=128 and <=2047 (7FF hex) then UTF-8 is 2 bytes long.
   byte 1 = 192 + (ud div 64)
   byte 2 = 128 + (ud mod 64)

If ud >=2048 and <=65535 (FFFF hex) then UTF-8 is 3 bytes long.
   byte 1 = 224 + (ud div 4096)
   byte 2 = 128 + ((ud div 64) mod 64)
   byte 3 = 128 + (ud mod 64)

If ud >=65536 and <=2097151 (1FFFFF hex) then UTF-8 is 4 bytes long.
   byte 1 = 240 + (ud div 262144)
   byte 2 = 128 + ((ud div 4096) mod 64)
   byte 3 = 128 + ((ud div 64) mod 64)
   byte 4 = 128 + (ud mod 64)

If ud >=2097152 and <=67108863 (3FFFFFF hex) then UTF-8 is 5 bytes long.
   byte 1 = 248 + (ud div 16777216)
   byte 2 = 128 + ((ud div 262144) mod 64)
   byte 3 = 128 + ((ud div 4096) mod 64)
   byte 4 = 128 + ((ud div 64) mod 64)
   byte 5 = 128 + (ud mod 64)

If ud >=67108864 and <=2147483647 (7FFFFFFF hex) then UTF-8 is 6 bytes long.
byte 1 = 252 + (ud div 1073741824)
byte 2 = 128 + ((ud div 16777216) mod 64)
byte 3 = 128 + ((ud div 262144) mod 64)
byte 4 = 128 + ((ud div 4096) mod 64)
byte 5 = 128 + ((ud div 64) mod 64)
byte 6 = 128 + (ud mod 64)


The operation div means integer division and mod means the rest after integer division.
For positive numbers a div b = int(a/b) and a mod b = (a/b-int(a/b))*b.
UTF-8 sequences of 4 bytes and longer are at the moment not supported by the regular browsers.
The highest character position currently (Unicode 3.2) defined is number 10FFFF hex (1114111 dec) in a 'private use' area. The highest character with an actual glyph is number E007F hex (917631 dec), the CANCEL TAG character.


From UTF-8 to Unicode UCS-4:
Let's take a UTF-8 byte sequence. The first byte in a new sequence will tell us how long the sequence is. Let's call the subsequent decimal bytes z y x w v u.


If z is between and including 0 - 127, then there is 1 byte z. The decimal Unicode value ud = the value of z.

If z is between and including 192 - 223, then there are 2 bytes z y; ud = (z-192)*64 + (y-128)

If z is between and including 224 - 239, then there are 3 bytes z y x; ud = (z-224)*4096 + (y-128)*64 + (x-128)

If z is between and including 240 - 247, then there are 4 bytes z y x w; ud = (z-240)*262144 + (y-128)*4096 + (x-128)*64 + (w-128)

If z is between and including 248 - 251, then there are 5 bytes z y x w v; ud = (z-248)*16777216 + (y-128)*262144 + (x-128)*4096 + (w-128) *64 + (v-128)

If z is 252 or 253, then there are 6 bytes z y x w v u; ud = (z-252) *1073741824 + (y-128)*16777216 + (x-128)*262144 + (w-128)*4096 + (v-128)*64 + (u-128)

If z = 254 or 255 then there is something wrong!

Example: take the decimal Unicode designation 8482 (decimal), which is for the trademark sign. This number is larger than 2048, so we get three numbers.
The first number is 224 + (8482 div 4096) = 224 + 2 = 226.
The second number is 128 + (8482 div 64) mod 64) = 128 + (132 mod 64) = 128 + 4 = 132.
The third number is 128 + (8482 mod 64) = 128 + 34 = 162.
Now the other way round. We see the numbers 226, 132 and 162. What is the decimal Unicode value?
In this case: (226-224)*4096+(132-128)*64+(162-128) = 8482.
And the conversion between hexadecimal and decimal? Come on, this is not a math tutorial! In case you don't know, use a calculator.


...

Also

http://en.wikipedia.org/wiki/UTF-8

On Aug 29, 2005, at 06:20, Ratan Bhangale wrote:

Dear All,



We are building application on MAC OS using Cocoa framework and Objective C.
All the messages which are to be displayed dynamically to the user should be
encoded using UTF-8. We want some sample code which will guide us about how
to convert plain text or Unicode to UTF-8.




Thanks and Regards

Ratan Bhangale

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


_______________________________________________ Do not post admin requests to the list. They will be ignored. Cocoa-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden
References: 
 >Information regarding UTF-8 code (From: "Ratan Bhangale" <email@hidden>)

  • Prev by Date: Information regarding UTF-8 code
  • Next by Date: How to change the Framework search path
  • Previous by thread: Information regarding UTF-8 code
  • Next by thread: Re: Information regarding UTF-8 code
  • Index(es):
    • Date
    • Thread