Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: "read from" and non-lo-ascii characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "read from" and non-lo-ascii characters

Subject: Re: "read from" and non-lo-ascii characters
From: Chris Page <email@hidden>
Date: Wed, 22 Jun 2005 01:32:33 -0700

On Jun 21, 2005, at 16:48, Christopher Nebel wrote:

On Jun 21, 2005, at 3:07 PM, Matt Neuburg wrote:
P.S.: Obligatory pedantry: there is no such thing as "high" or "low" ASCII. ASCII defines interpretations for bytes in the range 0...0x7F. If it's not in that range, it's not ASCII.
PPS: Obligatory pedantry from a professional linguist: Language is usage. The terms lo-ascii and hi-ascii were used consistently and meaningfully throughout the 80s and 90s. Furthermore, you knew *exactly* what my terminology meant, thus bearing witness against yourself. The defense rests.
PPPS: I may be a descriptive grammarian at heart, but most words do not have their meanings formally specified by a national standards body. ASCII does. I may have known what you meant, but it's still technically wrong. =) "Non-ASCII", however, would be correct.

Since this is one of my hot-button items, I'll chime in, too: "hi" or "low" ASCII are misleading terms for "not ASCII". When someone says "high ASCII" what they most often meant historically was specifically some character set, such as "the Commodore 64 character set", "the MS-DOS character set", "the Atari character set", or "the MacRoman character set". ie., it is an ambiguous term that can lead to misunderstanding.

Worse, in the case of Mac programming, the speaker often really meant "I'm assuming it's MacRoman" when they in fact were wrong.

Furthermore, most of the interesting modern character sets / encodings do not store all characters in a single byte, yet the term "high ASCII" at least historically almost always implied "the values 128-255, and stored in a single 8-bit byte, which, conveniently, could also store all of the ASCII values".

In this day-and-age we have some really useful and commonly available character sets and encodings with specific names, and it's more useful to know exactly which one is under discussion and to remove room for erroneous assumptions about the encoding or the storage size. Or, at the very least, it's important to say "non-ASCII" when you mean "anything other than ASCII".

This precision is important when discussing Applescript and Mac OS programming in particular, where text was often -- but not always -- stored as MacRoman or Shift-JIS, but the encoding information was rarely explicit and so you had to keep track of (or assume) a particular encoding if you cared to interpret the characters correctly. The modern Mac OS, including Applescript, supports Unicode, which fixes some of the ills of the past, and now it's even more important to be clear about which kind of text and characters you're talking about.

Just to paint a clear picture of how commonly used encodings differ and why it's important to be specific, the most common character sets / encodings in Mac programming are:

- ASCII: 7 bits per character, 128 characters

- MacRoman: 8 bits per character, the first 128 values are the same as ASCII; even when restricting the discussion to Mac OS you must remain aware that MacRoman is indistinguishable from other possible 8 bit character sets long supported on the Mac, where even the values 0-128 aren't always the same as ASCII, so assuming MacRoman even when you are correct that you are dealing with 8-bit characters isn't safe

- Shift-JIS: one or more 8-bit bytes per character, the first 128 values are the same as ASCII, values 128-255 are different from MacRoman, additional bytes of a multi-byte character may contain values 0-255 yet do not represent any ASCII or MacRoman characters

- ISO-Latin1: 8 bits per character, the first 128 values are the same as ASCII, 128-255 are not the same as in MacRoman or Shift-JIS

- Unicode: 20 bits per code point, characters consist of one or more code points, code points can be stored as one or more one-byte, two-byte, or four-byte "code units" (UTF-8/-16/-32), the first 256 values are the same as ISO-Latin1, additional bytes of a multi-byte sequence may contain values in the range 0-255 yet do not represent any ASCII, MacRoman, or ISO-Latin1 characters

I'm hoping all this detail will convince everyone that being vague about encodings -- and in particular, using the not-as-neutral-as-you-might-think, assumption-laden "high ASCII" -- is fraught with peril. Not, perhaps, as perilous as The Dreaded Three-headed Knight, the fiercest creature for yards around, but much too perilous, nonetheless.

--
Chris Page - Software Wrangler - Dylan Pundit

An ASCII character walks into a bar. Bartender asks, “What’ll you have?” ASCII character says, “Give me a double.” Bartender asks, “Having a bad day?” ASCII character says, “Yeah, I have a parity error.” Bartender says, “Hmmm. I thought you looked a bit off.”

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: "read from" and non-lo-ascii characters
From: Sander Tekelenburg <email@hidden>


References:  
  >Re: "read from" and non-lo-ascii characters (From: Matt Neuburg <email@hidden>)
  >Re: "read from" and non-lo-ascii characters (From: Christopher Nebel <email@hidden>)




Prev by Date:
Re: scripting barcodes

Next by Date:
Re: Why is copying a file so hard? part 3

Previous by thread:
Re: "read from" and non-lo-ascii characters

Next by thread:
Re: "read from" and non-lo-ascii characters

Index(es):

Date
Thread