On Feb 23, 2007, at 3:41 AM, Christopher Nebel wrote:
On Feb 22, 2007, at 3:32 AM, Andreas Kiel wrote:
The file's text encoding is either 6937/2-1983/Add.1:1989 (the
example I sent) or from ISO 8859/5 - ISO 8859/8 their might be
more since those do not cover eastern languages.
I just read the file (in chunks). The first 1024 bytes declare the
language, title etc.
The other blocks do have a length of 128 and text starts at 17-128.
I should have mentioned this before, but the other tool you may
find useful is iconv(1), which can convert between various text
encodings. It knows the entire ISO 8859 family, as well as the
various MS-DOS CPn encodings which I see the specification also
calls for. Unfortunately, neither it nor its relative piconv(1)
seem to know about ISO 6937. However, ISO 6937 looks pretty
simple; it wouldn't be difficult to write a 6937-to-Unicode
converter, though personally I'd rather not do it in AppleScript.
(If you're really ambitious, you could try adding ISO 6937 to the
system; the iconv library is part of GNU.)
Unfortunately, this encoding seems not very simple. It is said (I
hope the strange diacritical characters will pass through the net
correctly):
--------
The characters which are not represented in the primary set are coded
on two bytes. The first byte the "non spacing diacritical mark" is
followed by a letter from the base set e.g.:
small e with acute accent (é) = [Acute]+e
In total 13 diacritical marks can be followed by the selected
characters from the primary set:
So, if you have for example a sequence C1+41, you will have to
convert it to 0x0041+0x0300 (that is: LATIN CAPITAL LETTER A
+COMBINING GRAVE ACCENT), inverting the order of the combining
diacritical character and the character to which it is applied, and
then probably "normalize" the resulting combined Unicode character to
a precomposed Unicode character (for this part, you can perhaps use
the Perl module "Unicode::Normalize" or something like this).
This is certainly not impossible, but would need some laborious
routine...