Re: Automator bug in "run shell script" ?
Re: Automator bug in "run shell script" ?
- Subject: Re: Automator bug in "run shell script" ?
- From: Ron Hunsinger <email@hidden>
- Date: Wed, 15 Jun 2011 02:38:46 -0700
On Jun 14, 2011, at 7:26 PM, Jean-Christophe Helary wrote:
> Sorry Ron, I must be really thick.
>
> What I am seeing is that grep works fine with any sort of string I feed it when I run it in the Terminal and that Automator does something to that string before the command is passed to the shell so that grep is not grepping the same thing at all anymore when run in Automator.
The problem isn't in Automator, AppleScript, or "do shell script".
As I said, the problem is that:
a) HFS+ forces all filenames to their canonical fully decomposed form
b) grep et al. know nothing of Unicode nor of composing forms. They think the whole world is still ASCII.
So, in Terminal, try the following. (The number in the prompt is the exit code from the previous command)
(0)$ # Try to create a file with a composed character in its name
(0)$ # I'll use が, the Japanese Hiragana ga character (voiced ka)
(0)$ date > が
(0)$
(0)$ # The file seems to be there
(0)$ ls -l が
-rw-r--r-- 1 ronk _ronk 29 Jun 15 01:31 が
(0)$ cat が
Wed Jun 15 01:31:10 PDT 2011
(0)$
(0)$ # But wait. What if, instead of specifying the name literally, we grep for it
(0)$ ls -l | grep が
(1)$ # grep returns an exit code of 1, signifying no match
(1)$
(1)$ # Let's try a wildcard
(1)$ cat が*
cat: が*: No such file or directory
(1)$ ls -l が*
ls: が*: No such file or directory
(1)$ # How odd. The asterisk should have been able to match an empty string.
(1)$ # Let's try again, using ka instead of ga
(1)$ cat か*
Wed Jun 15 01:31:10 PDT 2011
(0)$ ls -l か*
-rw-r--r-- 1 ronk _ronk 29 Jun 15 01:31 が
(0)$ #
(0)$ # Or, we could decompose ga as we type it, entering ka + voice
(0)$ cat が*
Wed Jun 15 01:31:10 PDT 2011
(0)$ # Looks exactly the same on the screen, but now works because we typed it the way it appears on disk
What happened was that when the shell passed が (the composed character) to the filesystem to create the file in the first step, HFS+ automagically expanded it to its fully decomposed form, because all HFS+ filenames MUST be fully decomposed.
When we try to access the file using its composed form, the filesystem again expands it to fully decomposed form before looking it up, and finds it.
When we use a wildcard, the shell gets a list of all names in the directory, and does a pattern match against the glob. The glob contains the character in its composed form, and therefore does not match any of the returned filenames, none of which contain any composed characters.
Unless, that is, the glob contains a piece of the decomposed character. The decomposed character consists of か (U+304B=HIRAGANA LETTER KA) followed by (U+3099=KATAKANA-HIRAGANA VOICED SOUND MARK). If we search for か*, the か matches the first half, and the * matches the second half.
Bear in mind that none of the Unix tools are aware that there's any Unicode in the vicinity. They think the input, output and filenames are ASCII, and the C library routines accommodate by converting what you type, and the UCS16 that's actually on disk, to UTF8. UTF8, by design, is close enough to ASCII to pass for ASCII if you don't look too close. The filename on disk appears, after conversion to UTF8, to have the byte sequence E3 81 8B E3 82 99. When you type が, Terminal (actually readline) pretends you typed the three bytes E3 81 8C. The glob が* thus appears to be E3 81 8C 2A. The shell recognizes the 2A as an *, and searches for filenames beginning with E3 81 8C, but none of them do. If we glob for か* instead, it looks for filenames beginning with E3 81 8B and finds one.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Automator-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden