Re: changing international text to unicode text
Re: changing international text to unicode text
- Subject: Re: changing international text to unicode text
- From: Christopher Nebel <email@hidden>
- Date: Sun, 19 Dec 2004 21:42:05 -0800
On Dec 19, 2004, at 12:29 PM, Joseph Weaks wrote:
On Dec 19, 2004, at 9:33 AM, Emmanuel wrote:
...we've tried to summarize the main issues we are aware of at:
<http://www.satimage-software.com/en/unicode_and_applescript.com>
or even better:
http://www.satimage.fr/software/en/unicode_and_applescript.html
It's also slightly wrong in places. Specifically:
The string class basically stores one byte ([0..255]) per character.
The 128 first values are rendered according to the ASCII standard, for
instance ASCII character of 37 is the percent sign %. The 128 larger
values are rendered using the macintosh encoding, for instance ASCII
character of 150 is ñ. We refer to this encoding as the Mac-encoding.
Actually, it uses the "primary" Script Manager encoding, that is, the
one that goes with the first language listed in your International
preference pane. For most US and Western European users, this will be
MacRoman, in which case the rest is correct. However, other locales
use other encodings: Japanese, for example, would use MacJapanese (a
slightly enhanced Shift-JIS), which uses a mix of one- and two-byte
characters and is not isomorphic to either MacRoman or ASCII. (0x5F in
MacJapanese is a yen sign, not a backslash.)
There are cases where a string may store a more complex entity, we do
not address them here.
I assume you're talking about styled text here, which in fact you do
talk about later...
The Unicode text class stores two bytes or more per character, using
the UTF-16 encoding.
This happens to be true, but isn't really relevant -- it's an
implementation detail. All you really know is that a single
"character" of a Unicode text object is one Unicode code point. (Which
might be more than one UTF-16 word.) It's true that for Apple Event
Manager purposes (and therefore "write"), "Unicode text" does imply
UTF-16.
Be aware that the [Unicode text] file has to begin with ASCII 254,
ASCII 255.
This is not strictly true, but it will help downstream consumers.
Without the leading BOM, they won't be able to automatically tell that
the file is UTF-16; you'll have to tell them manually. (If the
consumer relies on the BOM, then it's effectively required.)
Since there is no tag which would specify whether a given file is
ASCII or UTF-8 ...
Actually there is -- in fact, you mention it above (hex EF BB BF) --
but most providers don't use it.
[I]n some circumstances where an application is really expecting a
regular string, you may get an AppleScript error if you pass such a
quantity.
Not a correction, but just so you know, any application that
specifically requires a descriptor of typeText has a bug.
--Chris Nebel
AppleScript Engineering _______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden