Re: Posix path and High Ascii Characters
Re: Posix path and High Ascii Characters
- Subject: Re: Posix path and High Ascii Characters
- From: Ron Hunsinger <email@hidden>
- Date: Mon, 09 Sep 2002 18:35:29 -0700
At 1:46 AM +0200 9/10/02, alain content wrote:
4. You wrote:
-- the proper UTF-8 sequence for an e-acute is {101, 204, 129}.
Now, this is just plain curiosity, but
what's the relation between that and what I'm seeing -- e\314\201 --
(except that perhaps 101 is 0065, hence the "e" ?)
It's just the difference between decimal and octal:
\314 (octal) = 204 (decimal) = 11 001 100 (binary)
\201 (octal) = 129 (decimal) = 10 000 001 (binary)
Put them together, and notice they have the pattern UTF-8 uses for
values that need at least 8 but not more than 11 bits: 110xxxxx
10xxxxxx, where the xes stand for the 11 data bits. The particular
value being represented is thus:
01100 000001 (binary) = \u0301 (hexadecimal)
This is the Unicode codepoint whose name is "COMBINING ACUTE ACCENT"
and whose meaning is "put an acute accent on the previous character".
And before you ask, the reason for having such a codepoint is so you
can put an accent on anything, even if no natural language needs that
particular accented character. Mathematics needs to be typeset too,
and mathematicians like to put all kinds of marks on all kinds of
characters.
The drawback of having such a codepoint is that it gives you more
than one way to "spell" the same character in Unicode. In the
particular case you stumbled across, an e with an acute accent can be
written either as:
"LATIN SMALL LETTER E" (\u0065) followed by
"COMBINING ACUTE ACCENT" (\0301)
or as
"LATIN SMALL LETTER E WITH ACUTE" (\u00E9)
Think of it as having more than one way to spell the same character.
[I believe it was Thomas Jefferson who said "I have nothing but
contempt for a man who knows only one way to spell a word."]
Normally, having more than one way to do something is a feature, but
in the case of filenames it can be downright confusing to the user
(and a royal PITA for the filesystem). Imagine trying to find a file
with several such characters in its name when you don't know which
spelling was used for each character.
HFS+ solves the problem by standardizing on a particular "spelling"
for each character that can be represented in more than one way. In
particular, it always prefers the form that uses a "COMBINING..."
over the one that doesn't. (It has to prefer one or the other, and
these fully decomposed forms have greater generality.)
But notice that it's only the filesystem that cares. All the other
unicode-aware software layers are perfectly happy with the
single-codepoint "LATIN SMALL LETTER E WITH ACUTE". Looking at that
again, you can see:
\u00E9 (hexadecimal) = 0000 0000 1110 1001 (binary)
= 00011 101001 (still binary, but grouped differently)
UTF-8 uses the same 110xxxxx 10xxxxxx pattern as before to encode that as:
11000011 10101001 (binary) = \303\251 (octal)
= é (hexadecimal) = {195,169} (decimal)
some of which you may have seen earlier in this thread.
-Ron Hunsinger
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.