Hello,
On Feb 23, 2007, at 3:41 AM, Christopher Nebel wrote:
On Feb 22, 2007, at 3:32 AM, Andreas Kiel wrote:
The file's text encoding is either 6937/2-1983/Add.1:1989 (the
example I sent) or from ISO 8859/5 - ISO 8859/8 their might be
more since those do not cover eastern languages.
I just read the file (in chunks). The first 1024 bytes declare
the language, title etc.
The other blocks do have a length of 128 and text starts at 17-128.
The tec doc about this format can be found at:
http://www.ebu.ch/CMSimages/en/tec_doc_t3264_tcm6-10528.pdf
I should have mentioned this before, but the other tool you may
find useful is iconv(1), which can convert between various text
encodings. It knows the entire ISO 8859 family, as well as the
various MS-DOS CPn encodings which I see the specification also
calls for. Unfortunately, neither it nor its relative piconv(1)
seem to know about ISO 6937. However, ISO 6937 looks pretty
simple; it wouldn't be difficult to write a 6937-to-Unicode
converter, though personally I'd rather not do it in AppleScript.
(If you're really ambitious, you could try adding ISO 6937 to the
system; the iconv library is part of GNU.)
I found this web page for ISO 6937:
<http://en.wikipedia.org/wiki/ISO_6937>
Unfortunately, this encoding seems not very simple. It is said (I
hope the strange diacritical characters will pass through the net
correctly):
--------
The characters which are not represented in the primary set are
coded on two bytes. The first byte the "non spacing diacritical
mark" is followed by a letter from the base set e.g.:
small e with acute accent (é) = [Acute]+e
In total 13 diacritical marks can be followed by the selected
characters from the primary set:
Accent Code Second character Result
Grave 0xC1 AEIOUaeiou ÀÈÌÒÙàèìòù
Acute 0xC2 ACEILNORSUYZaceilnorsuyz
ÁĆÉÍĹŃÓŔŚÚÝŹáćéíĺńóŕśúýź
Circumflex 0xC3 ACEGHIJOSUWYaceghijosuwy
ÂĈÊĜĤÎĴÔŜÛŴŶâĉêĝĥîĵôŝûŵŷ
Tilde 0xC4 AINOUainou ÃĨÑÕŨãĩñõũ
Macron 0xC5 AEIOUaeiou ĀĒĪŌŪāēīōū
Breve 0xC6 AGUagu ĂĞŬăğŭ
Dot 0xC7 CEGIZcegiz ĊĖĠİŻċėġıż
Umlaut 0xC8 AEIOUYaeiouy ÄËÏÖÜŸäëïöüÿ
Ring 0xCA AUau ÅŮåů
Cedilla 0xCB CGKLNRSTcgklnrst ÇĢĶĻŅŖŞŢçģķļņŗşţ
DoubleAcute 0xCD OUou ŐŰőű
Ogonek 0xCE AEIUaeiu ĄĘĮŲąęįų
Caron 0xCF CDELNRSTZcdelnrstz ČĎĚĽŇŘŠŤŽčďěľňřšťž
-------
So, if you have for example a sequence C1+41, you will have to
convert it to 0x0041+0x0300 (that is: LATIN CAPITAL LETTER A
+COMBINING GRAVE ACCENT), inverting the order of the combining
diacritical character and the character to which it is applied, and
then probably "normalize" the resulting combined Unicode character
to a precomposed Unicode character (for this part, you can perhaps
use the Perl module "Unicode::Normalize" or something like this).
This is certainly not impossible, but would need some laborious
routine...
Best regards,
Nobumi Iyanaga
Tokyo,
Japan
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-studio mailing list (Applescript-
email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/applescript-studio/
email@hidden
This email sent to email@hidden