Re: Removing html tags
Re: Removing html tags
- Subject: Re: Removing html tags
- From: has <email@hidden>
- Date: Mon, 28 Feb 2005 14:59:50 +0000
Paff wrote:
I want to remove all html tags from a document downloaded with curl
utility. I want to remove everything that is between <> characters
(including <> chars) so I end up with plain text. Example: if in a
document there's "<title>BOS Bank</title>" I'd like to get only "BOS
Bank"; if there's "<TD class="tabelka01" rowspan="2"
align="center">Kod</TD>" I want to get only "Kod" string etc.
Getting plain text from an HTML document is one of those problems
that looks simple enough on the surface but turns out to be
horrendously complicated in practice. By far the best and simplest
solution is to use a scriptable web browser, HTML editing/processing
tool, high-quality 3rd-party library or system API that already knows
how to deal with real-world HTML and can retrieve an HTML document's
content in plain-text format, e.g.:
tell application "Safari"
open alias "path:to:file.html"
set title to name of window 1
set body to text of document 1
close document 1
end tell
return {title, body}
Naive approaches such as simple regexes or that crappy guidebook
remove_markup() handler won't handle stuff like whitespace and
character entities in a sensible fashion [1] and can easily mess up
on <head> content, comments, poorly-formed markup, etc., making them
far more trouble than they're worth.
HTH
has
[1] Note that the elderly Safari 1.0.2 I'm using doesn't strip excess
whitespace 100% correctly. This may be fixed in newer versions;
you'll need to test it yourself or use another agent if it's a
problem.
--
http://freespace.virgin.net/hamish.sanderson/
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden