Re: Convert MS Word to HTML
Re: Convert MS Word to HTML
- Subject: Re: Convert MS Word to HTML
- From: Federico <email@hidden>
- Date: Sat, 15 Nov 2003 13:26:17 +0100
Mercoledl, 12 Nov 2003, alle 11:17 Europe/Rome, Mats-Olof Liljegren ha
scritto:
Problem:
Needs to make an applescript that takes a MS Word document from one
directory, open it and save it as HTML in a different location. Don't
now how to make this happen.
The problem is that often Word's HTML output is "a bit" messy.
However, if you can teach the teachers to format a Word document using
styles (eg: Heading 1, Heading 2, etc) you can get a decent structured
HTML.
Then you can use Tidy [1] to convert the HTML document to XHTML and use
XML-related technologies to parse the document and filter out the mess.
I use this technique every thursday, to manage a client's e-mail
newsletter: he sends me a rather long Word document that I need to
convert to HTML.
I can't just take the document and "Save as web page" directly since
the document is built by copying and pasting from different sources so
the formatting is rather messy.
But I can't even work on a plain text version: I'd need to manually
re-apply almost all of the formatting, and also I'd loose all the links.
So I do some cleanup first, converting titles made just by applying a
larger font to the text to the appropriate style (Heading 1, 2, or 3)
and applying section breaks (I use them to "automagically" build a
table of contents).
After the cleanup the document is saved as HTML and converted to XHTML
with Tidy.
The XHTML document is then filtered with xsltproc [2] with a custom
stylesheet.
The stylesheet filters out all attributes (classes and in-line CSS
mostly) from every tags, converts all B and I tags to STRONG and EM and
strips any other tag keeping only H1, H2, H3, P, STRONG, EM and A (the
only ones I need), and applies an header and footer. Since the document
is rather long, the stylesheet also builds a linked table of content.
[1]
http://tidy.sourceforge.net/
[2] xsltproc is an XSLT processor, available here:
http://www.xmlsoft.org/XSLT.html and also available with Fink.
XSLT is a languae for transforming XML documents into other XML
documents
http://www.w3.org/TR/xslt
http://www.zvon.org/xxl/XSLTreference/Output/index.html
--
Federico
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.