Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Text interpretation/reading question



Hello,

On Feb 23, 2007, at 3:41 AM, Christopher Nebel wrote:


On Feb 22, 2007, at 3:32 AM, Andreas Kiel wrote:

The file's text encoding is either 6937/2-1983/Add.1:1989 (the example I sent) or from ISO 8859/5 - ISO 8859/8 their might be more since those do not cover eastern languages.

I just read the file (in chunks). The first 1024 bytes declare the language, title etc.
The other blocks do have a length of 128 and text starts at 17-128.


The tec doc about this format can be found at:
http://www.ebu.ch/CMSimages/en/tec_doc_t3264_tcm6-10528.pdf

I should have mentioned this before, but the other tool you may find useful is iconv(1), which can convert between various text encodings. It knows the entire ISO 8859 family, as well as the various MS-DOS CPn encodings which I see the specification also calls for. Unfortunately, neither it nor its relative piconv(1) seem to know about ISO 6937. However, ISO 6937 looks pretty simple; it wouldn't be difficult to write a 6937-to-Unicode converter, though personally I'd rather not do it in AppleScript. (If you're really ambitious, you could try adding ISO 6937 to the system; the iconv library is part of GNU.)

I found this web page for ISO 6937: <http://en.wikipedia.org/wiki/ISO_6937>

Unfortunately, this encoding seems not very simple. It is said (I hope the strange diacritical characters will pass through the net correctly):
--------


The characters which are not represented in the primary set are coded on two bytes. The first byte the "non spacing diacritical mark" is followed by a letter from the base set e.g.:

small e with acute accent (é) = [Acute]+e

In total 13 diacritical marks can be followed by the selected characters from the primary set:

Accent Code Second character Result
Grave 0xC1 AEIOUaeiou ÀÈÌÒÙàèìòù
Acute 0xC2 ACEILNORSUYZaceilnorsuyz ÁĆÉÍĹŃÓŔŚÚÝŹáćéíĺńóŕśúýź
Circumflex 0xC3 ACEGHIJOSUWYaceghijosuwy ÂĈÊĜĤÎĴÔŜÛŴŶâĉêĝĥîĵôŝûŵŷ
Tilde 0xC4 AINOUainou ÃĨÑÕŨãĩñõũ
Macron 0xC5 AEIOUaeiou ĀĒĪŌŪāēīōū
Breve 0xC6 AGUagu ĂĞŬăğŭ
Dot 0xC7 CEGIZcegiz ĊĖĠİŻċėġıż
Umlaut 0xC8 AEIOUYaeiouy ÄËÏÖÜŸäëïöüÿ
Ring 0xCA AUau ÅŮåů
Cedilla 0xCB CGKLNRSTcgklnrst ÇĢĶĻŅŖŞŢçģķļņŗşţ
DoubleAcute 0xCD OUou ŐŰőű
Ogonek 0xCE AEIUaeiu ĄĘĮŲąęįų
Caron 0xCF CDELNRSTZcdelnrstz ČĎĚĽŇŘŠŤŽčďěľňřšťž


-------

So, if you have for example a sequence C1+41, you will have to convert it to 0x0041+0x0300 (that is: LATIN CAPITAL LETTER A +COMBINING GRAVE ACCENT), inverting the order of the combining diacritical character and the character to which it is applied, and then probably "normalize" the resulting combined Unicode character to a precomposed Unicode character (for this part, you can perhaps use the Perl module "Unicode::Normalize" or something like this).

This is certainly not impossible, but would need some laborious routine...

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-studio mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/applescript-studio/email@hidden

This email sent to email@hidden
References: 
 >Text interpretation/reading question (From: Andreas Kiel <email@hidden>)
 >Re: Text interpretation/reading question (From: Dean Shavit <email@hidden>)
 >Re: Text interpretation/reading question (From: Andreas Kiel <email@hidden>)
 >Re: Text interpretation/reading question (From: Christopher Nebel <email@hidden>)
 >Re: Text interpretation/reading question (From: Andreas Kiel <email@hidden>)
 >Re: Text interpretation/reading question (From: Christopher Nebel <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.