Re: Automator bug in "run shell script" ?
Re: Automator bug in "run shell script" ?
- Subject: Re: Automator bug in "run shell script" ?
- From: Ron Hunsinger <email@hidden>
- Date: Tue, 14 Jun 2011 12:53:16 -0700
On Jun 14, 2011, at 7:30 AM, Jean-Christophe Helary wrote:
> I get that with all accented Latin characters (including ç) but also with Japanese characters that contain a voicing mark like "が" where instead of getting a U+304C, Automator will produce a U+304B followed by a U+3099 [か ゙].
>
> That bug practically means that no international text processing is possible with the "run shell script" action of Automator.
Bugs are in the eye of the beholder.
What you're seeing is that Apple sometimes converts Unicode characters to their fully decomposed form. And to understand what "fully decomposed form" means, you have to understand the difference between characters and codepoints.
Unicode defines a number of codepoints, which are numeric codes to represent graphical entities. Some of those codepoints are defined as "COMBINING", meaning that they combine with the preceding codepoint(s) to form a single character.
Thus, there is not a one-to-one correspondence between characters and codepoints. Some characters can be "spelled" in more than one way. As in your example, the Hiragana character が, can be encoded in Unicode either as the single codepoint U+304C (HIRAGANA LETTER GA) or as the pair of codepoints U+304B,U+3099 (HIRAGANA LETTER KA, COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).
The thing to understand is that even though the Unicode codepoints are different, THEY'RE STILL THE SAME CHARACTER. Supporting international text isn't easy, which is why it took so long. Unicode makes things a lot easier, but it's no panacea. There are subtleties that even Unicode can't step around. Any international text processing worthy of the name must take these subtleties into account.
AppleScript, for example, recognizes these sequences as the same character. If you drag them from the Character Viewer into two strings in AppleScript, AS will display them the same and say they're equal. The only way to tell they're still different is to try dragging them back into the Character View window. Mail does likewise: が and が look the same in this message, but the former is decomposed and the latter is not. Drag the latter to Character Viewer, and it'll show as HIRAGANA LETTER GA. Drag the former to Character Viewer, and it's confused because you dragged in two codepoints, and Character Viewer, despite its name, is for viewing codepoints, not characters.
Apple is forced to expand characters in filenames into a canonical form, so that filenames with the same characters don't wind up on disk with different codepoint sequences. The canonical form they chose was to expand all characters into their fully decomposed form, in which any combining code that can be separated from the base character is so separated, and the combining codes (there may be more than one) are in a standard order.
Automator is probably expanding all text to fully decomposed form. They have to do that for filenames anyway, and getting all text into a canonical form simplifies international text processing.
You might be able to work around your problem by passing the output from "do shell script" from one automator step to the next, letting Automator do the work of fully decomposing the characters. Or just take advantage of the fact that AppleScript knows what it's doing.
But it's all part and parcel of "international text processing". It's not easy. You're going to have to do some of the work. _______________________________________________
Do not post admin requests to the list. They will be ignored.
Automator-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden