Re: Writing to file as UTF8 with BOM ?
Re: Writing to file as UTF8 with BOM ?
- Subject: Re: Writing to file as UTF8 with BOM ?
- From: "Mark J. Reed" <email@hidden>
- Date: Mon, 30 Oct 2006 10:31:30 -0500
On 10/30/06, Yvon Thoraval <email@hidden> wrote:
yes i agree with that except their are repertoire wider than other isn't it
?
i thought utf-16, using more bytes than utf-8 in your analogy, is a wider
repertoire than utf-8 ?
NO.
All UTF's are able to encode *E0XACTLY* the same set of characters.
That's the whole point of Unicode. The UTF's specify the nit-picky
details about how to actually represent Unicode text in bytes, but the
Unicode text itself is a stream of characters, and it is not affected
by the encoding method used.
UTF-8 is an 8-bit encoding only in the sense that it's defined in
terms of bytes. That doesn't mean that each character takes up only
one byte. Depending on the character, it may take up one, two, three,
or four bytes.
Example UTF-8 representations:
U+0041 LATIN CAPITAL LETTER A ('A') one byte, value 0x41.
U+00E7 LATIN SMALL LETTER C WITH CEDILLA ('ç'): two bytes, values 0xC3 0xA7
U+0905 DEVANAGARI LETTER A ('अ'): three bytes, values 0xE0 0xA4 0x85
U+10000 LINEAR B SYLLABLE A ('𐀀'): four bytes, values 0xF0 0x90 0x80 0x80
UTF-16 is defined in terms of 16-bit 'short words' (two bytes). That
doesn't mean that each character takes up only one word, however.
Depending on the character it may take up one or two words:
U+0041 LATIN CAPITAL LETTER A ('A') one word, value 0x0041
U+00E7 LATIN SMALL LETTER C WITH CEDILLA ('ç'): one word, value 0x00E7
U+0905 DEVANAGARI LETTER A ('अ'): one word, value 0x0905
U+10000 LINEAR B SYLLABLE A ('𐀀'): two words, values 0xD800 0xDC00
UTF-16 by itself just defines that sequence of 16-bit values; it has
nothing to say about how they're physically stored as bytes. For
example, the word 0x0041 may become (0x00, 0x41) - which we call
"big-endian" - or (0x41, 0x00) - which we call "little-endian". The
designations "UTF-16BE" and "UTF-16LE" are used to refer to sequences
of bytes that are meant to be interpreted as UTF-16 words according to
the indicated ordering.
UTF-32 is defined in terms of 32-bit 'long words' (four bytes). In
this case, every character consists of exactly four bytes, and the
high byte is always zero. So it's not very space-efficient, but it is
computationally more efficient than a hypothetical UTF-24 because most
computers are designed to treat 4-byte quantities as a unit; on such
computers, dealing with three-byte words three bytes apart requires a
lot more work on the part of the CPU.
Examples:
U+0041 LATIN CAPITAL LETTER A ('A') one word, value 0x00000041
U+00E7 LATIN SMALL LETTER C WITH CEDILLA ('ç'): one word, value 0x000000E7
U+0905 DEVANAGARI LETTER A ('अ'): one word, value 0x00000905
U+10000 LINEAR B SYLLABLE A ('𐀀'): one word, value 0x00010000
Like UTF-16, UTF-32 by itself just defines that sequence of 32-bit
values and has nothing to say about how they're physically stored as
bytes. For example, the word 0x000041 may become (0x00, 0x00, 0x00,
0x41) - which we call "big-endian" - or (0x41, 0x00, 0x00, 0x00) -
which we call "little-endian". The designations "UTF-32BE" and
"UTF-32LE" are used to refer to sequences of bytes that are meant to
be interpreted as UTF-32 words according to the indicated ordering.
In the case of UTF32 there are other possibilities, wherein the bytes
within a short word are arranged in big-endian order while the short
words within a long word are arranged in little-endian order, or
vice-versa; these don't have short names and are of mostly
historical/theoretical interest these days.
--
Mark J. Reed <email@hidden>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/mailman//archives/applescript-users
This email sent to email@hidden