Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Removing html tags



Paff wrote:

I want to remove all html tags from a document downloaded with curl utility. I want to remove everything that is between <> characters (including <> chars) so I end up with plain text. Example: if in a document there's "<title>BOS Bank</title>" I'd like to get only "BOS Bank"; if there's "<TD class="tabelka01" rowspan="2" align="center">Kod</TD>" I want to get only "Kod" string etc.

Getting plain text from an HTML document is one of those problems that looks simple enough on the surface but turns out to be horrendously complicated in practice. By far the best and simplest solution is to use a scriptable web browser, HTML editing/processing tool, high-quality 3rd-party library or system API that already knows how to deal with real-world HTML and can retrieve an HTML document's content in plain-text format, e.g.:


tell application "Safari"
	open alias "path:to:file.html"
	set title to name of window 1
	set body to text of document 1
	close document 1
end tell
return {title, body}

Naive approaches such as simple regexes or that crappy guidebook remove_markup() handler won't handle stuff like whitespace and character entities in a sensible fashion [1] and can easily mess up on <head> content, comments, poorly-formed markup, etc., making them far more trouble than they're worth.

HTH

has

[1] Note that the elderly Safari 1.0.2 I'm using doesn't strip excess whitespace 100% correctly. This may be fixed in newer versions; you'll need to test it yourself or use another agent if it's a problem.
--
http://freespace.virgin.net/hamish.sanderson/
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/applescript-users/email@hidden


This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.