Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Text interpretation/reading question



Hello Andreas,

On Feb 23, 2007, at 11:56 PM, Andreas Kiel wrote:

Many thanks Chris and Nobumi,

I succeeded to build a 'quick'n dirty' routine for latin (ASCII) text using the table you supplied by removing the Unicode stuff.
As I know nothing about Perl is there any place I can find a 'how to use Perl module "Unicode::Normalize"' . I googled a little but didn't find any helpful things.
I will have a look at the iconv man as well for the other standards.


Chris I am ambitious, but this 'you could try adding ISO 6937 to the system' is above my scope.

Again thanks to all for the tremendous help.

I tried to write a model of Perl code to convert a string in "ISO 6937/2-1983, Addendum 1-1989" to UTF-8; this is laborious, and I could only write a model, not the whole code...
Anyway...:


Preparation:
I think you will need UnicodeChecker (<http://earthlingsoft.net/ UnicodeChecker/>) to begin to work.


Next, you will need to prepare two Perl hashes. A Perl hash looks like this: my %hash = ("a" => "z", "b" => "y");
Hereafter, I will use the form "\x20" for one byte encoding, and "\x {0020}" for Unicode equivalent.


1. For the one byte part, you will have to make a Perl hash like this:
my %others = ("\x24" => "\x{00A4}", "\xA4" => "\x{0024}", ...);
Explanation: in ISO 6937/2, the character 0x42 (normally "$") is used for "general currency sign" (¤) (see <http://en.wikipedia.org/wiki/ ISO_6937>);
The character "\xA4" ("§" in MacRoman) is used for "$" = \x{0024}, etc...
I didn't make the complete hash, because there are many code points. For code points like "\xE4", you will probably need to use UnicodeChecker to find out the corresponding character in Unicode: you will open the find panel (CMD + F) in UnicodeChecker, and type there "Latin Capital Letter H"; many candidates will be listed in the list below; you will find there "LATIN CAPITAL LETTER H WITH STROKE [U +0126]": this is the code you need: "\xE4" => "\x{0126}", and so on... This is probably the most cumbersome part.


2. For the range \x80-\x8A: you will have to make a Perl hash like the following:
my %controls = ("\x80" => "<i>", "\x81" => "</i>", ..., "\x8A" => "\n");
Explanation: these are the control characters described in the table in page 14 of the document "tec_doc_t3264_tcm6-10528_EBU_STL.pdf". I didn't make the complete hash; you will have to complete it. And of course you can use any other tags for "Italic ON", "Italic OFF", etc.


3. For the range \xC1-\xCF: you will have to prepare code lines like this:
my @aeiou = qw (A E I O U a e i o u);
my $uni_grave = "\x{0300}";
# Grave 0xC1 0x0300 AEIOUaeiou
Explanation: each of the code points in this range accepts only a defined series of characters with which it will be combined. For example, "\xC1", grave accent, accepts only one of the series "AEIOUaeiou". The line "my @aeiou = qw (A E I O U a e i o u);" makes a Perl list variable named "@aeiou", storing each of the character series in question. The line "my $uni_grave = "\x{0300}";" makes a Perl scalar variable named "$uni_grave", storing Unicode COMBINING GRAVE ACCENT, which is "\x{0300}". You can find this using UnicodeChecker, entering "combining grave" in the Find field. The third line "# Grave 0xC1 0x0300 AEIOUaeiou" is a simple comment, to not forget what we are doing...
You will have to make similar lines, for example:
my @aceilnorsuyz = qw (A C E I L N O R S U Y Z a c e i l n o r s u y z);
my $uni_acute = "\x{0301}";
# Acute 0xC2 0x0301 ACEILNORSUYZaceilnorsuyz


my @aceghijosuwy = qw (A C E G H I J O S U W Y a c e g h i j o s u w y);
my $uni_circumflex = "\x{0302}";
# Circumflex	0xC3	0x0302	ACEGHIJOSUWYaceghijosuwy

...

Now, here is the code I wrote:


#!/usr/bin/perl

use strict;
use warnings;

local undef $/;
my $infile = "/Users/[my_account]/Desktop/test.txt"; # this is the file path of the file storing a data like this:
# "Fiquei aqui ¡a toa. S¬o paguei pelo jatoäpara ganharmos tempo."
open (IN, $infile);
$_ = <IN>;
close (IN);


my @aeiou = qw (A E I O U a e i o u); # see above, "Preparation (3)"
my $uni_grave = "\x{0300}";
# Grave	0xC1	0x0300	AEIOUaeiou

my @aceilnorsuyz = qw (A C E I L N O R S U Y Z a c e i l n o r s u y z);
my $uni_acute = "\x{0301}";
# Acute	0xC2	0x0301	ACEILNORSUYZaceilnorsuyz

my @aceghijosuwy = qw (A C E G H I J O S U W Y a c e g h i j o s u w y);
my $uni_circumflex = "\x{0302}";
# Circumflex	0xC3	0x0302	ACEGHIJOSUWYaceghijosuwy

my %others = ("\x24" => "\x{00A4}", "\xA4" => "\x{0024}"); # see above, "Preparation (1)"
my %controls = ("\x80" => "<i>", "\x81" => "</i>", "\x8A" => "\n"); # see above, "Preparation (2)"


foreach my $key (@aeiou) { # see explanation (1) below
	s/\xC1$key/$key$uni_grave/gs;
}
s/\xC1\x20/\x{0060}/sg; # see explanation (2) below

foreach my $key (@aceilnorsuyz) { # for acute accent
	s/\xC2$key/$key$uni_acute/gs;
}
s/\xC2\x20/\x{00B4}/gs;

foreach my $key (@aceghijosuwy) { # for circumflex accent
	s/\xC3$key/$key$uni_circumflex/gs;
}
s/\xC3\x20/\x{02C6}/gs;

s/(.)/my $temp = $1; exists $others{$temp} ? $others{$temp}: $temp/ eg; # see explanation (3) below

s/(.)/my $temp = $1; exists $controls{$temp} ? $controls{$temp}: $temp/eg; # see explanation (4) below


use Unicode::Normalize; # see explanation (5) below $_ = NFC($_);

binmode (STDOUT, ":utf8"); # see explanation (6) below
print;
##### end of script ######

In this script, we read the data in the file "test.txt"; it is stored in the special variable "$_": the "default scalar" variable.

I pasted the prepared lines for the range "\xC1-\xC3" (you will have to add from "\xC4" to "\xCF"); I pasted also the prepared hashes "% others", and "%controls". Then, the conversion begins:

Explanation:
1. The lines:
foreach my $key (@aeiou) {
s/\xC1$key/$key$uni_grave/gs;
}
is a loop for the list variable @aeiou. Each item of the list is set to the variable "$key"; the second line does a global find and replace: it searches for "\xC1a", "\xC1e", etc., and if it finds a matching string, it replaces it with "a\x{{0300}", "e\x{{0300}", etc., that is an "a" followed by a "COMBINING GRAVE ACCENT"


2. The next line:
s/\xC1\x20/\x{0060}/sg;
does also a global find and replace: it searches for "\xC1" + space, and replaces it with "`" (= "\x{0060}"): this is because it is written "A diacritic as a free-standing character is created by coding a space behind the byte that represents the "diacritical mark"." in <http://blogs.msdn.com/michkap/archive/ 2005/01/22/358675.aspx>. For some of the replacing characters, you will have to search in the Unicode range "Spacing Modifier Letters" (\x{02B0}-\x{02FF}): for example, "Double Acute" is "\x {02BA}"...


3. The line "s/(.)/my $temp = $1; exists $others{$temp} ? $others {$temp}: $temp/eg;" does a global find and replace for characters that are stored in the hash variable "%others".

4. The line "s/(.)/my $temp = $1; exists $controls{$temp} ? $controls {$temp}: $temp/eg;" does a global find and replace for characters that are stored in the hash variable "% controls" (for example, "\x8A" will be replaced by a linefeed...).

5. The two lines:
use Unicode::Normalize; # see explanation (5) below
$_ = NFC($_);
"normalize" the combined diacritical characters (for example "a\x {{0300}") into the precomposed equivalents (for example "à", which is "\x{00E0}") whenever this is possible.


6. Finally, the line "binmode (STDOUT, ":utf8");" transforms all the string to be "printed" to a UTF-8 string.

------------

I have tested this code only on a very short example data. I am not sure at all if it will work for every case...

If you want to call this kind of Perl script from within your AppleScript script, you would do something like the following:

First, the first 5 lines:
local undef $/;
my $infile = "/Users/[my_account]/Desktop/test.txt"; # this is the file path of the file storing a data like this:
# "Fiquei aqui ¡a toa. S¬o paguei pelo jatoäpara ganharmos tempo."
open (IN, $infile);
$_ = <IN>;
close (IN);


must be commented out (with a "#" at the beginning of each line), and a new line must be added after them:

$_ = shift;

Now, save all this in a script file, for example "iso6937toUnicode.pl". In your AppleScript code, you will make a variable in which you will store your data:

set my_str to ... -- for example "Fiquei aqui ¡a toa. S¬o paguei pelo jatoäpara ganharmos tempo."

now, you will do:

set perl_path to POSIX path of ... -- the path to your "iso6937toUnicode.pl"

try
set res to do shell script "/usr/bin/perl " & quoted form of perl_path & space & quoted form of my_str
on error errMsg
display dialog errMsg
return
end try


Now, in the variable "res", you *should* have the resulting converted string in UTF-8...

-------

By the way, for iconv, you would do something like the following (from within your AppleScript code):

set my_str to ... -- your data in ISO 8859/5, for example
try
set res to do shell script "iconv -f ISO-8859-5 -t UTF-8" & quoted form of my_str
on error errMsg
display dialog errMsg
return
end try


Note that I didn't do any testing...  Please use this with caution!!

-------

This is all for now...

I hope all this is of some help for you. If you have problems, please write me (on or off-list).

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-studio mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/applescript-studio/email@hidden

This email sent to email@hidden
References: 
 >Text interpretation/reading question (From: Andreas Kiel <email@hidden>)
 >Re: Text interpretation/reading question (From: Dean Shavit <email@hidden>)
 >Re: Text interpretation/reading question (From: Andreas Kiel <email@hidden>)
 >Re: Text interpretation/reading question (From: Christopher Nebel <email@hidden>)
 >Re: Text interpretation/reading question (From: Andreas Kiel <email@hidden>)
 >Re: Text interpretation/reading question (From: Christopher Nebel <email@hidden>)
 >Re: Text interpretation/reading question (From: Nobumi Iyanaga <email@hidden>)
 >Re: Text interpretation/reading question (From: Andreas Kiel <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.