Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Removing html tags

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Removing html tags

Subject: Re: Removing html tags
From: has <email@hidden>
Date: Mon, 28 Feb 2005 14:59:50 +0000

Paff wrote:

I want to remove all html tags from a document downloaded with curl utility. I want to remove everything that is between <> characters (including <> chars) so I end up with plain text. Example: if in a document there's "<title>BOS Bank</title>" I'd like to get only "BOS Bank"; if there's "<TD class="tabelka01" rowspan="2" align="center">Kod</TD>" I want to get only "Kod" string etc.

Getting plain text from an HTML document is one of those problems that looks simple enough on the surface but turns out to be horrendously complicated in practice. By far the best and simplest solution is to use a scriptable web browser, HTML editing/processing tool, high-quality 3rd-party library or system API that already knows how to deal with real-world HTML and can retrieve an HTML document's content in plain-text format, e.g.:

tell application "Safari"
	open alias "path:to:file.html"
	set title to name of window 1
	set body to text of document 1
	close document 1
end tell
return {title, body}

Naive approaches such as simple regexes or that crappy guidebook remove_markup() handler won't handle stuff like whitespace and character entities in a sensible fashion [1] and can easily mess up on <head> content, comments, poorly-formed markup, etc., making them far more trouble than they're worth.

HTH

has

[1] Note that the elderly Safari 1.0.2 I'm using doesn't strip excess whitespace 100% correctly. This may be fixed in newer versions; you'll need to test it yourself or use another agent if it's a problem. -- http://freespace.virgin.net/hamish.sanderson/ _______________________________________________ Do not post admin requests to the list. They will be ignored. Applescript-users mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden Follow-Ups: Re: Removing html tags From: Neil Faiman <email@hidden> Prev by Date: Re: Text size of Finder window Next by Date: Re: What is this? Previous by thread: Re: Removing html tags Next by thread: Re: Removing html tags Index(es): Date Thread