Re: "read from" and non-lo-ascii characters
Re: "read from" and non-lo-ascii characters
- Subject: Re: "read from" and non-lo-ascii characters
- From: Chris Page <email@hidden>
- Date: Wed, 22 Jun 2005 01:32:33 -0700
On Jun 21, 2005, at 16:48, Christopher Nebel wrote:
On Jun 21, 2005, at 3:07 PM, Matt Neuburg wrote:
P.S.: Obligatory pedantry: there is no such thing as "high" or "low"
ASCII. ASCII defines interpretations for bytes in the range
0...0x7F. If it's not in that range, it's not ASCII.
PPS: Obligatory pedantry from a professional linguist: Language is
usage. The terms lo-ascii and hi-ascii were used consistently and
meaningfully throughout the 80s and 90s. Furthermore, you knew
*exactly* what my terminology meant, thus bearing witness against
yourself. The defense rests.
PPPS: I may be a descriptive grammarian at heart, but most words do
not have their meanings formally specified by a national standards
body. ASCII does. I may have known what you meant, but it's still
technically wrong. =) "Non-ASCII", however, would be correct.
Since this is one of my hot-button items, I'll chime in, too: "hi" or
"low" ASCII are misleading terms for "not ASCII". When someone says
"high ASCII" what they most often meant historically was specifically
some character set, such as "the Commodore 64 character set", "the
MS-DOS character set", "the Atari character set", or "the MacRoman
character set". ie., it is an ambiguous term that can lead to
misunderstanding.
Worse, in the case of Mac programming, the speaker often really meant
"I'm assuming it's MacRoman" when they in fact were wrong.
Furthermore, most of the interesting modern character sets / encodings
do not store all characters in a single byte, yet the term "high ASCII"
at least historically almost always implied "the values 128-255, and
stored in a single 8-bit byte, which, conveniently, could also store
all of the ASCII values".
In this day-and-age we have some really useful and commonly available
character sets and encodings with specific names, and it's more useful
to know exactly which one is under discussion and to remove room for
erroneous assumptions about the encoding or the storage size. Or, at
the very least, it's important to say "non-ASCII" when you mean
"anything other than ASCII".
This precision is important when discussing Applescript and Mac OS
programming in particular, where text was often -- but not always --
stored as MacRoman or Shift-JIS, but the encoding information was
rarely explicit and so you had to keep track of (or assume) a
particular encoding if you cared to interpret the characters correctly.
The modern Mac OS, including Applescript, supports Unicode, which fixes
some of the ills of the past, and now it's even more important to be
clear about which kind of text and characters you're talking about.
Just to paint a clear picture of how commonly used encodings differ and
why it's important to be specific, the most common character sets /
encodings in Mac programming are:
- ASCII: 7 bits per character, 128 characters
- MacRoman: 8 bits per character, the first 128 values are the same as
ASCII; even when restricting the discussion to Mac OS you must remain
aware that MacRoman is indistinguishable from other possible 8 bit
character sets long supported on the Mac, where even the values 0-128
aren't always the same as ASCII, so assuming MacRoman even when you are
correct that you are dealing with 8-bit characters isn't safe
- Shift-JIS: one or more 8-bit bytes per character, the first 128
values are the same as ASCII, values 128-255 are different from
MacRoman, additional bytes of a multi-byte character may contain values
0-255 yet do not represent any ASCII or MacRoman characters
- ISO-Latin1: 8 bits per character, the first 128 values are the same
as ASCII, 128-255 are not the same as in MacRoman or Shift-JIS
- Unicode: 20 bits per code point, characters consist of one or more
code points, code points can be stored as one or more one-byte,
two-byte, or four-byte "code units" (UTF-8/-16/-32), the first 256
values are the same as ISO-Latin1, additional bytes of a multi-byte
sequence may contain values in the range 0-255 yet do not represent any
ASCII, MacRoman, or ISO-Latin1 characters
I'm hoping all this detail will convince everyone that being vague
about encodings -- and in particular, using the
not-as-neutral-as-you-might-think, assumption-laden "high ASCII" -- is
fraught with peril. Not, perhaps, as perilous as The Dreaded
Three-headed Knight, the fiercest creature for yards around, but much
too perilous, nonetheless.
--
Chris Page - Software Wrangler - Dylan Pundit
An ASCII character walks into a bar. Bartender asks, “What’ll you
have?” ASCII character says, “Give me a double.” Bartender asks,
“Having a bad day?” ASCII character says, “Yeah, I have a parity
error.” Bartender says, “Hmmm. I thought you looked a bit off.”
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden