Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Posix path and High Ascii Characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Posix path and High Ascii Characters

Subject: Re: Posix path and High Ascii Characters
From: Ron Hunsinger <email@hidden>
Date: Mon, 09 Sep 2002 18:35:29 -0700

At 1:46 AM +0200 9/10/02, alain content wrote:

4. You wrote:

-- the proper UTF-8 sequence for an e-acute is {101, 204, 129}.

Now, this is just plain curiosity, but
what's the relation between that and what I'm seeing -- e\314\201 --
(except that perhaps 101 is 0065, hence the "e" ?)

It's just the difference between decimal and octal:

\314 (octal) = 204 (decimal) = 11 001 100 (binary)
\201 (octal) = 129 (decimal) = 10 000 001 (binary)

Put them together, and notice they have the pattern UTF-8 uses for values that need at least 8 but not more than 11 bits: 110xxxxx 10xxxxxx, where the xes stand for the 11 data bits. The particular value being represented is thus:

01100 000001 (binary) = \u0301 (hexadecimal)

This is the Unicode codepoint whose name is "COMBINING ACUTE ACCENT" and whose meaning is "put an acute accent on the previous character".

And before you ask, the reason for having such a codepoint is so you can put an accent on anything, even if no natural language needs that particular accented character. Mathematics needs to be typeset too, and mathematicians like to put all kinds of marks on all kinds of characters.

The drawback of having such a codepoint is that it gives you more than one way to "spell" the same character in Unicode. In the particular case you stumbled across, an e with an acute accent can be written either as:

"LATIN SMALL LETTER E" (\u0065) followed by
"COMBINING ACUTE ACCENT" (\0301)

or as

"LATIN SMALL LETTER E WITH ACUTE" (\u00E9)

Think of it as having more than one way to spell the same character. [I believe it was Thomas Jefferson who said "I have nothing but contempt for a man who knows only one way to spell a word."]

Normally, having more than one way to do something is a feature, but in the case of filenames it can be downright confusing to the user (and a royal PITA for the filesystem). Imagine trying to find a file with several such characters in its name when you don't know which spelling was used for each character.

HFS+ solves the problem by standardizing on a particular "spelling" for each character that can be represented in more than one way. In particular, it always prefers the form that uses a "COMBINING..." over the one that doesn't. (It has to prefer one or the other, and these fully decomposed forms have greater generality.)

But notice that it's only the filesystem that cares. All the other unicode-aware software layers are perfectly happy with the single-codepoint "LATIN SMALL LETTER E WITH ACUTE". Looking at that again, you can see:

\u00E9 (hexadecimal) = 0000 0000 1110 1001 (binary)
= 00011 101001 (still binary, but grouped differently)

UTF-8 uses the same 110xxxxx 10xxxxxx pattern as before to encode that as:

11000011 10101001 (binary) = \303\251 (octal)
= é (hexadecimal) = {195,169} (decimal)

some of which you may have seen earlier in this thread.

-Ron Hunsinger
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

References:
	>Re: Posix path and High Ascii Characters (From: alain content <email@hidden>)

Prev by Date: Re: What the heck is this !? (reassigned alias)
Next by Date: Need Help with merging Excel files
Previous by thread: Re: Posix path and High Ascii Characters
Next by thread: Re: Posix path and High Ascii Characters
Index(es):
- Date
- Thread