Problems choosing an encoding for Word generated html
Problems choosing an encoding for Word generated html
- Subject: Problems choosing an encoding for Word generated html
- From: Ken Tozier <email@hidden>
- Date: Sun, 31 May 2009 19:08:19 -0400
Hi
I wrote an app that converts Word files into a simpler format by first
converting from .doc to html using scripting and Word's "Save as Web
page" command followed by using NSXMLDocument to extract the parts I
need. I'm finding that there are no good options when it comes to
choosing a character encoding for the saved html (this is set in Word)
because it uses some custom tags to embed special characters like
bullets and that UTF-8 chokes on.
My basic process is to
- Use Applescript to tell Word to convert from .doc to html and save
as utf-8
- Read the resultant file into an NSString with NSUTF8StringEncoding
I've tried saving the html from Word as NSLatin1Encoding but many
important characters like double-quotes, apostrophes, dashes etc are
translated to cap "O's" with various diacritical marks.
Not really sure how to proceed as there doesn't seem to be a single
encoding useable by NSString that will both translate the quotes and
allow me to access Word's "special" characters. Anyone have any ideas
how I can read the html and treat it as a mostly normal character
string without resorting to a custom binary character translation
class?
Thanks for any help
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden