Re: Problems choosing an encoding for Word generated html
Re: Problems choosing an encoding for Word generated html
- Subject: Re: Problems choosing an encoding for Word generated html
- From: Michael Ash <email@hidden>
- Date: Mon, 1 Jun 2009 11:28:01 -0400
On Sun, May 31, 2009 at 7:08 PM, Ken Tozier <email@hidden> wrote:
> Hi
>
> I wrote an app that converts Word files into a simpler format by first
> converting from .doc to html using scripting and Word's "Save as Web page"
> command followed by using NSXMLDocument to extract the parts I need. I'm
> finding that there are no good options when it comes to choosing a character
> encoding for the saved html (this is set in Word) because it uses some
> custom tags to embed special characters like bullets and that UTF-8 chokes
> on.
>
> My basic process is to
> - Use Applescript to tell Word to convert from .doc to html and save as
> utf-8
> - Read the resultant file into an NSString with NSUTF8StringEncoding
>
> I've tried saving the html from Word as NSLatin1Encoding but many important
> characters like double-quotes, apostrophes, dashes etc are translated to cap
> "O's" with various diacritical marks.
>
> Not really sure how to proceed as there doesn't seem to be a single encoding
> useable by NSString that will both translate the quotes and allow me to
> access Word's "special" characters. Anyone have any ideas how I can read the
> html and treat it as a mostly normal character string without resorting to a
> custom binary character translation class?
UTF-8 shouldn't choke on anything. It is a universal character
encoding. It's vaguely possible that Word uses some custom characters
that aren't even in Unicode, but if it does, those characters won't be
in any *other* encoding either, so they wouldn't work regardless.
Can you elaborate on just what choosing UTF-8 produces and how it fails?
In any case, this is probably more of a Word question than a Cocoa
question, and I imagine you'd get better answers somewhere where
people are knowledgeable about Word.
Mike
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden