I succeeded to build a 'quick'n dirty' routine for latin (ASCII)
text using the table you supplied by removing the Unicode stuff.
As I know nothing about Perl is there any place I can find a 'how
to use Perl module "Unicode::Normalize"' . I googled a little but
didn't find any helpful things.
I will have a look at the iconv man as well for the other standards.
Chris I am ambitious, but this 'you could try adding ISO 6937 to
the system' is above my scope.
Again thanks to all for the tremendous help.
I tried to write a model of Perl code to convert a string in "ISO
6937/2-1983, Addendum 1-1989" to UTF-8; this is laborious, and I
could only write a model, not the whole code...
Anyway...:
Preparation:
I think you will need UnicodeChecker (<http://earthlingsoft.net/
UnicodeChecker/>) to begin to work.
Next, you will need to prepare two Perl hashes. A Perl hash looks
like this: my %hash = ("a" => "z", "b" => "y");
Hereafter, I will use the form "\x20" for one byte encoding, and "\x
{0020}" for Unicode equivalent.
1. For the one byte part, you will have to make a Perl hash like this:
my %others = ("\x24" => "\x{00A4}", "\xA4" => "\x{0024}", ...);
Explanation: in ISO 6937/2, the character 0x42 (normally "$") is used
for "general currency sign" (¤) (see <http://en.wikipedia.org/wiki/
ISO_6937>);
The character "\xA4" ("§" in MacRoman) is used for "$" = \x{0024},
etc...
I didn't make the complete hash, because there are many code points.
For code points like "\xE4", you will probably need to use
UnicodeChecker to find out the corresponding character in Unicode:
you will open the find panel (CMD + F) in UnicodeChecker, and type
there "Latin Capital Letter H"; many candidates will be listed in the
list below; you will find there "LATIN CAPITAL LETTER H WITH STROKE [U
+0126]": this is the code you need: "\xE4" => "\x{0126}", and so
on... This is probably the most cumbersome part.
2. For the range \x80-\x8A: you will have to make a Perl hash like
the following:
my %controls = ("\x80" => "<i>", "\x81" => "</i>", ..., "\x8A" => "\n");
Explanation: these are the control characters described in the table
in page 14 of the document "tec_doc_t3264_tcm6-10528_EBU_STL.pdf". I
didn't make the complete hash; you will have to complete it. And of
course you can use any other tags for "Italic ON", "Italic OFF", etc.
3. For the range \xC1-\xCF: you will have to prepare code lines like
this:
my @aeiou = qw (A E I O U a e i o u);
my $uni_grave = "\x{0300}";
# Grave 0xC1 0x0300 AEIOUaeiou
Explanation: each of the code points in this range accepts only a
defined series of characters with which it will be combined. For
example, "\xC1", grave accent, accepts only one of the series
"AEIOUaeiou". The line "my @aeiou = qw (A E I O U a e i o u);" makes
a Perl list variable named "@aeiou", storing each of the character
series in question. The line "my $uni_grave = "\x{0300}";" makes a
Perl scalar variable named "$uni_grave", storing Unicode COMBINING
GRAVE ACCENT, which is "\x{0300}". You can find this using
UnicodeChecker, entering "combining grave" in the Find field. The
third line "# Grave 0xC1 0x0300 AEIOUaeiou" is a simple comment, to
not forget what we are doing...
You will have to make similar lines, for example:
my @aceilnorsuyz = qw (A C E I L N O R S U Y Z a c e i l n o r s u y z);
my $uni_acute = "\x{0301}";
# Acute 0xC2 0x0301 ACEILNORSUYZaceilnorsuyz
my @aceghijosuwy = qw (A C E G H I J O S U W Y a c e g h i j o s u w y);
my $uni_circumflex = "\x{0302}";
# Circumflex 0xC3 0x0302 ACEGHIJOSUWYaceghijosuwy
...
Now, here is the code I wrote:
#!/usr/bin/perl
use strict;
use warnings;
local undef $/;
my $infile = "/Users/[my_account]/Desktop/test.txt"; # this is the
file path of the file storing a data like this:
# "Fiquei aqui ¡a toa. S¬o paguei pelo jatoäpara ganharmos tempo."
open (IN, $infile);
$_ = <IN>;
close (IN);
my @aeiou = qw (A E I O U a e i o u); # see above, "Preparation (3)"
my $uni_grave = "\x{0300}";
# Grave 0xC1 0x0300 AEIOUaeiou
my @aceilnorsuyz = qw (A C E I L N O R S U Y Z a c e i l n o r s u y z);
my $uni_acute = "\x{0301}";
# Acute 0xC2 0x0301 ACEILNORSUYZaceilnorsuyz
my @aceghijosuwy = qw (A C E G H I J O S U W Y a c e g h i j o s u w y);
my $uni_circumflex = "\x{0302}";
# Circumflex 0xC3 0x0302 ACEGHIJOSUWYaceghijosuwy
my %others = ("\x24" => "\x{00A4}", "\xA4" => "\x{0024}"); # see
above, "Preparation (1)"
my %controls = ("\x80" => "<i>", "\x81" => "</i>", "\x8A" => "\n"); #
see above, "Preparation (2)"
foreach my $key (@aeiou) { # see explanation (1) below
s/\xC1$key/$key$uni_grave/gs;
}
s/\xC1\x20/\x{0060}/sg; # see explanation (2) below
foreach my $key (@aceilnorsuyz) { # for acute accent
s/\xC2$key/$key$uni_acute/gs;
}
s/\xC2\x20/\x{00B4}/gs;
foreach my $key (@aceghijosuwy) { # for circumflex accent
s/\xC3$key/$key$uni_circumflex/gs;
}
s/\xC3\x20/\x{02C6}/gs;
use Unicode::Normalize; # see explanation (5) below
$_ = NFC($_);
binmode (STDOUT, ":utf8"); # see explanation (6) below
print;
##### end of script ######
In this script, we read the data in the file "test.txt"; it is stored
in the special variable "$_": the "default scalar" variable.
I pasted the prepared lines for the range "\xC1-\xC3" (you will have
to add from "\xC4" to "\xCF"); I pasted also the prepared hashes "%
others", and "%controls". Then, the conversion begins:
Explanation:
1. The lines:
foreach my $key (@aeiou) {
s/\xC1$key/$key$uni_grave/gs;
}
is a loop for the list variable @aeiou. Each item of the list is set
to the variable "$key"; the second line does a global find and
replace: it searches for "\xC1a", "\xC1e", etc., and if it finds a
matching string, it replaces it with "a\x{{0300}", "e\x{{0300}",
etc., that is an "a" followed by a "COMBINING GRAVE ACCENT"
2. The next line:
s/\xC1\x20/\x{0060}/sg;
does also a global find and replace: it searches for "\xC1" + space,
and replaces it with "`" (= "\x{0060}"): this is because it is
written "A diacritic as a free-standing character is created by
coding a space behind the byte that represents the "diacritical
mark"." in <http://blogs.msdn.com/michkap/archive/
2005/01/22/358675.aspx>. For some of the replacing characters, you
will have to search in the Unicode range "Spacing Modifier
Letters" (\x{02B0}-\x{02FF}): for example, "Double Acute" is "\x
{02BA}"...
3. The line "s/(.)/my $temp = $1; exists $others{$temp} ? $others
{$temp}: $temp/eg;" does a global find and replace for characters
that are stored in the hash variable "%others".
4. The line "s/(.)/my $temp = $1; exists $controls{$temp} ? $controls
{$temp}: $temp/eg;" does a global find and replace for characters
that are stored in the hash variable "% controls" (for example,
"\x8A" will be replaced by a linefeed...).
5. The two lines:
use Unicode::Normalize; # see explanation (5) below
$_ = NFC($_);
"normalize" the combined diacritical characters (for example "a\x
{{0300}") into the precomposed equivalents (for example "à", which is
"\x{00E0}") whenever this is possible.
6. Finally, the line "binmode (STDOUT, ":utf8");" transforms all the
string to be "printed" to a UTF-8 string.
------------
I have tested this code only on a very short example data. I am not
sure at all if it will work for every case...
If you want to call this kind of Perl script from within your
AppleScript script, you would do something like the following:
First, the first 5 lines:
local undef $/;
my $infile = "/Users/[my_account]/Desktop/test.txt"; # this is the
file path of the file storing a data like this:
# "Fiquei aqui ¡a toa. S¬o paguei pelo jatoäpara ganharmos tempo."
open (IN, $infile);
$_ = <IN>;
close (IN);
must be commented out (with a "#" at the beginning of each line), and
a new line must be added after them:
$_ = shift;
Now, save all this in a script file, for example
"iso6937toUnicode.pl". In your AppleScript code, you will make a
variable in which you will store your data:
set my_str to ... -- for example "Fiquei aqui ¡a toa. S¬o paguei pelo
jatoäpara ganharmos tempo."
now, you will do:
set perl_path to POSIX path of ... -- the path to your
"iso6937toUnicode.pl"
try
set res to do shell script "/usr/bin/perl " & quoted form of
perl_path & space & quoted form of my_str
on error errMsg
display dialog errMsg
return
end try
Now, in the variable "res", you *should* have the resulting converted
string in UTF-8...
-------
By the way, for iconv, you would do something like the following
(from within your AppleScript code):
set my_str to ... -- your data in ISO 8859/5, for example
try
set res to do shell script "iconv -f ISO-8859-5 -t UTF-8" & quoted
form of my_str
on error errMsg
display dialog errMsg
return
end try
Note that I didn't do any testing... Please use this with caution!!
-------
This is all for now...
I hope all this is of some help for you. If you have problems,
please write me (on or off-list).