• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Convert MS Word to HTML
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Convert MS Word to HTML


  • Subject: Re: Convert MS Word to HTML
  • From: Federico <email@hidden>
  • Date: Sat, 15 Nov 2003 13:26:17 +0100

Mercoledl, 12 Nov 2003, alle 11:17 Europe/Rome, Mats-Olof Liljegren ha scritto:
Problem:
Needs to make an applescript that takes a MS Word document from one directory, open it and save it as HTML in a different location. Don't now how to make this happen.

The problem is that often Word's HTML output is "a bit" messy.

However, if you can teach the teachers to format a Word document using styles (eg: Heading 1, Heading 2, etc) you can get a decent structured HTML.

Then you can use Tidy [1] to convert the HTML document to XHTML and use XML-related technologies to parse the document and filter out the mess.

I use this technique every thursday, to manage a client's e-mail newsletter: he sends me a rather long Word document that I need to convert to HTML.

I can't just take the document and "Save as web page" directly since the document is built by copying and pasting from different sources so the formatting is rather messy.

But I can't even work on a plain text version: I'd need to manually re-apply almost all of the formatting, and also I'd loose all the links.

So I do some cleanup first, converting titles made just by applying a larger font to the text to the appropriate style (Heading 1, 2, or 3) and applying section breaks (I use them to "automagically" build a table of contents).

After the cleanup the document is saved as HTML and converted to XHTML with Tidy.

The XHTML document is then filtered with xsltproc [2] with a custom stylesheet.

The stylesheet filters out all attributes (classes and in-line CSS mostly) from every tags, converts all B and I tags to STRONG and EM and strips any other tag keeping only H1, H2, H3, P, STRONG, EM and A (the only ones I need), and applies an header and footer. Since the document is rather long, the stylesheet also builds a linked table of content.



[1] http://tidy.sourceforge.net/
[2] xsltproc is an XSLT processor, available here:
http://www.xmlsoft.org/XSLT.html and also available with Fink.
XSLT is a languae for transforming XML documents into other XML documents
http://www.w3.org/TR/xslt
http://www.zvon.org/xxl/XSLTreference/Output/index.html

--
Federico
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

  • Prev by Date: Re: How can I pass a counter result out of a script object
  • Next by Date: Saving To Target File Issue...
  • Previous by thread: Re: Convert MS Word to HTML
  • Next by thread: getting digests #2135 - 2148 - sorted, thanks
  • Index(es):
    • Date
    • Thread