Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Bad Characters from UnicodeŠ

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bad Characters from UnicodeŠ

Subject: Re: Bad Characters from UnicodeŠ
From: "Mark J. Reed" <email@hidden>
Date: Tue, 2 Oct 2007 19:15:59 -0400

> Is WINDOWS-1252 an alternate way to name UTF-8 ?

Not even close.

As I said earlier, it appears that Mail.app has a list of possible
character sets, and it uses the smallest one that has all the
characters in the message.  The character set you use to *type* the
message doesn't enter into it.  If Mail has a "use this encoding for
this message" option and isn't honoring it, well, that's a bug.

The character sets it uses seem to be these:

1. US-ASCII.   128 characters (95 printable).
2. ISO-8859-1 ("Latin-1").  256 characters (191 printable).
3. Windows-1252.  256 characters (223 printable).
4. Unicode (via UTF-8).  Up to 1114112 characters.

Which means that:

If you send a message that says just "Hello!", it will go in US-ASCII.

If you send a message that says "¡Hola!" it will go in ISO-8859-1.

If you send a message that says "Wait…" it will go in Windows-1252.

If you send a message that says "Здравствуйте!", it will go in Unicode.

UTF-8 is one of many ways of taking a Unicode message, which consists
of characters, and turning it into a sequence of actual data bytes you
can transmit.
For single-byte character sets like the first three above, this is
simple, since they have fewer than 256 characters: just match up bytes
to characters one-to-one.
But since there are only 256 different bytes, and Unicode has many
more than 256 characters, at least some of them have to be represented
by multiple bytes each.  That fact introduces a tradeoff between
simplicity (e.g. every character takes up 4 bytes no matter what) and
efficiency (e.g. common characters take up one byte, slightly rarer
characters two bytes, etc.).

So there are several different ways of turning a sequence of Unicode
characters into a sequence of data bytes; these are called "Unicode
Transformation Formats", or UTF's.  And the most common, at least in
email, is UTF-8, which uses 8 bits (one byte) for the characters that
match US-ASCII, and two to four bytes for other characters.

Unicode text files as stored on disk in Mac OS X are usually in
UTF-16, which uses two bytes for most characters and four bytes for
the rest.

--
Mark J. Reed <email@hidden>

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

References:
	>Re: Bad Characters from Unicode… (From: KOENIG Yvan <email@hidden>)
	>Re: Bad Characters from Unicode… (From: Axel Luttgens <email@hidden>)
	>Re: Bad Characters from Unicode… (From: KOENIG Yvan <email@hidden>)
	>Re: Bad Characters from Unicode (From: KOENIG Yvan <email@hidden>)
	>Re: Bad Characters from Unicode (From: Philip Aker <email@hidden>)
	>Re: Bad Characters from Unicode (From: KOENIG Yvan <email@hidden>)

Prev by Date: Re: Bad Characters from Unicode
Next by Date: Search and replace script (Quark)
Previous by thread: Re: Bad Characters from Unicode
Next by thread: Re: Bad Characters from Unicode
Index(es):
- Date
- Thread