Re: Writing to file as UTF8 with BOM ?
Re: Writing to file as UTF8 with BOM ?
- Subject: Re: Writing to file as UTF8 with BOM ?
- From: "Mark J. Reed" <email@hidden>
- Date: Mon, 30 Oct 2006 12:37:24 -0500
On 10/30/06, Yvon Thoraval <email@hidden> wrote:
OK, then i don't need to use UTF-16 for anciant chinese.
Right. Any UTF will work for any characters, including the pnes in
the supplemental panes, including some uncommon/obsolete Han
characters, Linear B, musical notes...
maybe the question of the word length is a question of speed with a given
computer,
Yes.
i thought morden computesr are able to address at the byte level
even if the word in RAM is a mutibyte one ?
It's true that modern computers can typically read a single byte from
(or write a single byte to) any address. But they can also read or
write two bytes at once at any even address, and four bytes at once at
any address that's a multiple of four. (64-bit CPUs can also read or
write eight bytes at once at any address that's a multiple of eight.)
So in UTF-32. a character may typically be be read or written in a
single memory transaction. If, instead, there were only three bytes
per character, it would take two memory transactions (a two-byte and
then a single-byte, if the address is even; otherwise a single-byte
followed by a two-byte if the address is odd. Or the code may abandon
that small efficiency gain to avoid a special case on the parity of
the pointer, and just always read or write single bytes at a time - in
which case it becomes three transactions per character).
utf-8 is only usefull because of indianness then ...
1. It's "endianness", with an E, because it concerns which "end"of the
word is at the lower address (the little end, or least-significant
bits, or the big end, with the most significant bits). Indians,
whether of the Asian or American variety, have nothing to do with it,
although the word "Indian" no doubt influenced Mr. Swift when he
coined "endian" while writing Gulliver's Travels.
2. UTF-8 is useful for several reasons, not least of which is that
it's backwards compatible with ASCII: a 7-bit ASCII text file is,
without any modifications whatsoever, a perfectly legal UTF-8 text
file.
UTF-8 is also reasonably compact for Latin-based scripts. It starts
losing the size battle to UTF-16 for the scripts of the Subcontinent
and the far East, which is why there are things like SCSU
(http://www.unicode.org/reports/tr6/) that let you shift the position
of the subset of characters that's representable with single-byte
values.
--
Mark J. Reed <email@hidden>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/mailman//archives/applescript-users
This email sent to email@hidden