Re: Remove HTML Tags
Re: Remove HTML Tags
- Subject: Re: Remove HTML Tags
- From: Douglas Davidson <email@hidden>
- Date: Mon, 24 Nov 2008 11:51:37 -0800
On Nov 24, 2008, at 2:02 AM, Rob Keniger wrote:
On 24/11/2008, at 6:54 PM, Jean-Daniel Dupas wrote:
Hello, what's the best way to remove html tags and javascript from
a NSString?
(I'm working on a web crawler and I'm needing a way to get the
contents of a page that doesn't have a description on it.)
Thanks,
Mr. Gecko
Just a suggestion:
loading it in a WebView and retreiving the page text content.
Or you can use one of the various -initWithHTML methods of
NSAttributedString and then just ask for the -string value.
Here's what I wrote a couple of years ago on the general topic of
"HTML stripping", i.e., obtaining the plain-text content of a given
piece of HTML:
NSAttributedString's HTML import feature will do this, but it's too
heavyweight to really be thought of as "HTML stripping". It will give
you a full rich-text copy of the HTML--the equivalent of selecting and
copying from Safari and pasting in TextEdit, for example--from which
you are then going to discard all of the formatting.
If you just want to extract plain text from HTML, you can do it with
NSXMLDocument, using the NSXMLDocumentTidyHTML option. That will
avoid dealing with the formatting, and so should be significantly
faster.
Bear in mind that the notion of the plain-text content of HTML is not
well-defined in general. The most troublesome issue is whitespace.
For example, in HTML the relationship between two adjacent paragraphs
in text, or between two adjacent cells in a table, is a logical one
that is represented in the rendered result by a certain spatial
offset. In a plain-text representation one might prefer to have this
represented by certain whitespace characters, but it is not
necessarily obvious which ones are to be chosen, and in practice any
two different plain-text conversion mechanisms are likely to give
different results. Whitespace within the HTML source itself, on the
other hand, is supposed to be of limited significance, since the HTML
specification calls for it to be collapsed under most circumstances;
that may or may not occur under any particular plain-text conversion
mechanism.
Another issue is generated content, such as list markers, which does
not actually exist within the HTML itself, but instead is generated at
rendering time. A simple plain-text conversion mechanism may simply
ignore it; a more complex one may represent it in one way or another,
but there may not necessarily be a suitable plain-text representation
in all cases. Because of issues of this sort, you may need to
consider what it is that you actually want out of an HTML to plain-
text conversion process, and examine the various options available to
you to see how well they agree with what you want.
Douglas Davidson
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden