• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Remove HTML Tags
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Remove HTML Tags


  • Subject: Re: Remove HTML Tags
  • From: Douglas Davidson <email@hidden>
  • Date: Mon, 24 Nov 2008 11:51:37 -0800


On Nov 24, 2008, at 2:02 AM, Rob Keniger wrote:

On 24/11/2008, at 6:54 PM, Jean-Daniel Dupas wrote:

Hello, what's the best way to remove html tags and javascript from a NSString?
(I'm working on a web crawler and I'm needing a way to get the contents of a page that doesn't have a description on it.)


Thanks,
Mr. Gecko

Just a suggestion: loading it in a WebView and retreiving the page text content.


Or you can use one of the various -initWithHTML methods of NSAttributedString and then just ask for the -string value.

Here's what I wrote a couple of years ago on the general topic of "HTML stripping", i.e., obtaining the plain-text content of a given piece of HTML:


NSAttributedString's HTML import feature will do this, but it's too heavyweight to really be thought of as "HTML stripping". It will give you a full rich-text copy of the HTML--the equivalent of selecting and copying from Safari and pasting in TextEdit, for example--from which you are then going to discard all of the formatting.

If you just want to extract plain text from HTML, you can do it with NSXMLDocument, using the NSXMLDocumentTidyHTML option. That will avoid dealing with the formatting, and so should be significantly faster.

Bear in mind that the notion of the plain-text content of HTML is not well-defined in general. The most troublesome issue is whitespace. For example, in HTML the relationship between two adjacent paragraphs in text, or between two adjacent cells in a table, is a logical one that is represented in the rendered result by a certain spatial offset. In a plain-text representation one might prefer to have this represented by certain whitespace characters, but it is not necessarily obvious which ones are to be chosen, and in practice any two different plain-text conversion mechanisms are likely to give different results. Whitespace within the HTML source itself, on the other hand, is supposed to be of limited significance, since the HTML specification calls for it to be collapsed under most circumstances; that may or may not occur under any particular plain-text conversion mechanism.

Another issue is generated content, such as list markers, which does not actually exist within the HTML itself, but instead is generated at rendering time. A simple plain-text conversion mechanism may simply ignore it; a more complex one may represent it in one way or another, but there may not necessarily be a suitable plain-text representation in all cases. Because of issues of this sort, you may need to consider what it is that you actually want out of an HTML to plain- text conversion process, and examine the various options available to you to see how well they agree with what you want.

Douglas Davidson
_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


References: 
 >Remove HTML Tags (From: "Mr. Gecko" <email@hidden>)
 >Re: Remove HTML Tags (From: Jean-Daniel Dupas <email@hidden>)
 >Re: Remove HTML Tags (From: Rob Keniger <email@hidden>)

  • Prev by Date: Re: iPhone Development Lists
  • Next by Date: Re: CALayer containing a view
  • Previous by thread: Re: Remove HTML Tags
  • Next by thread: Re: Remove HTML Tags
  • Index(es):
    • Date
    • Thread