Character sets, applescript, and the web [almost ASCII]
Character sets, applescript, and the web [almost ASCII]
- Subject: Character sets, applescript, and the web [almost ASCII]
- From: Brian Johnson <email@hidden>
- Date: Mon, 3 Dec 2001 09:23:03 -0800 (PST)
The recent "ASCII" discussion has pushed me over the edge, so I'm going to
try and fix a problem I've had for some time related to this. I'm hoping
that some of you have a better grasp on this than I do! I need to know how
to properly support different system character sets in CGIs. Here's what I
think so far ....
1. A mac web server sends out an HTML file to a browser. Interpretation of
the characters in this file depends on the presence of a "Meta charset"
tag, or defaults to a browser default.
2. The user types text into an INPUT or TEXTAREA field, using their
native language and all appropriate characters.
3. (Does the browser covert the text to the charset of the HTML?) The
browser then sends this text, after URL-encoding, to the server.
4. The server simply passes the form arguments through to the CGI, right?
(Do W*, Apache, WebTen, QPQ, PWS, etc. all do the same thing here?)
5. The CGI decodes the URL and gets ... (what? MacRoman? Latin1? does it
depend on the language config of the host? English vs. whatever?)
6. The CGI may optionally use various OSAX to get date strings, etc. What
charset do these come in?
7. The CGI (in my case) writes the user's input to the same HTML file that
started it all, presumably honoring the local character set.... (Ha!)
I've got clients who use the CGI (ConfeWeb) with non-roman scripts like
Chinese and say it works just fine. I've got others who just want accents
and umlauts who can't seem to get what they need. I'd like to understand
all this. Is there an easy way to reliably simulate (on a US system) input
from a non-US system? Does the server's language setting play in this?
Any and all input is most welcome!
Brian Johnson, Dept of Architecture, University of Washington, Seattle
ConferWeb:
http://www.caup.washington.edu/software/
p.s. I recall reading that the 8th ASCII bit was originally meant as a
parity bit in the rs232 standard. It (and it's 128 hi-ascii characters)
got hijacked in the '80s. And, of course, different folks had different
ideas on how to use them. We had a charset ROM for an early PC (iirc) that
had fragments of integral signs and sigmas in the upper-ascii area, along
with a word-processor for doing equations, etc.