Re: Removing html tags
Re: Removing html tags
- Subject: Re: Removing html tags
- From: Neil Faiman <email@hidden>
- Date: Mon, 28 Feb 2005 18:59:36 -0500
On Feb 28, 2005, at 9:59 AM, has wrote:
Getting plain text from an HTML document is one of those problems that
looks simple enough on the surface but turns out to be horrendously
complicated in practice. By far the best and simplest solution is to
use a scriptable web browser, HTML editing/processing tool,
high-quality 3rd-party library or system API that already knows how to
deal with real-world HTML and can retrieve an HTML document's content
in plain-text format, e.g.:
... [Safari solution deleted]
Naive approaches such as simple regexes or that crappy guidebook
remove_markup() handler won't handle stuff like whitespace and
character entities in a sensible fashion [1] and can easily mess up on
<head> content, comments, poorly-formed markup, etc., making them far
more trouble than they're worth.
Also, has's suggestion plays to AppleScript's strengths rather than
its weaknesses. Trying to solve almost any non-trivial problem in
AppleScript itself is a losing proposition. AS's strength is as a
scripting language, not a programming language. Generally, the right
question when attacking a problem from AppleScript is, "What tool do I
already have on my system that knows how to the solve this problem for
me, and how can I tell it to do that from AppleScript?" So, rather than
trying to code up a semi-adequate HTML remover yourself in AppleScript,
you find a scriptable application that knows how to do it well. The
Safari suggestion is one good one. If you have BBEdit on your system,
you could look into its "translate html to text" command. In any case,
the idea is to take advantage of the work that someone else has already
done.
Regards,
Neil Faiman
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden