• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag
 

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Convert HTML to plain text
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Convert HTML to plain text


  • Subject: Re: Convert HTML to plain text
  • From: Douglas Davidson <email@hidden>
  • Date: Wed, 25 Apr 2007 14:46:37 -0700


On Apr 18, 2007, at 1:43 AM, David Brennan wrote:

I need these description's in plain text. Some are plain text and some
are HTML. How can I convert an NSString that contains HTML to just the
text.

Here's a copy of what I wrote on this issue earlier:

"NSAttributedString's HTML import feature will do this, but it's too heavyweight to really be thought of as "HTML stripping". It will give you a full rich-text copy of the HTML--the equivalent of selecting and copying from Safari and pasting in TextEdit, for example--from which you are then going to discard all of the formatting.

If you just want to extract plain text from HTML, you can do it with NSXMLDocument, using the NSXMLDocumentTidyHTML option. That will avoid dealing with the formatting, and so should be significantly faster.

Bear in mind that the notion of the plain-text content of HTML is not well-defined in general. The most troublesome issue is whitespace. For example, in HTML the relationship between two adjacent paragraphs in text, or between two adjacent cells in a table, is a logical one that is represented in the rendered result by a certain spatial offset. In a plain-text representation one might prefer to have this represented by certain whitespace characters, but it is not necessarily obvious which ones are to be chosen, and in practice any two different plain-text conversion mechanisms are likely to give different results. Whitespace within the HTML source itself, on the other hand, is supposed to be of limited significance, since the HTML specification calls for it to be collapsed under most circumstances; that may or may not occur under any particular plain-text conversion mechanism.

Another issue is generated content, such as list markers, which does not actually exist within the HTML itself, but instead is generated at rendering time. A simple plain-text conversion mechanism may simply ignore it; a more complex one may represent it in one way or another, but there may not necessarily be a suitable plain-text representation in all cases. Because of issues of this sort, you may need to consider what it is that you actually want out of an HTML to plain-text conversion process, and examine the various options available to you to see how well they agree with what you want."

For reference, here is some code I've run across that uses NSXMLDocument:

NSXMLDocument *doc = [[NSXMLDocument alloc] initWithXMLString:html options:NSXMLDocumentTidyHTML error:nil];
result = [[doc stringValue] stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceAndNewlineCharacterSet]];
[doc release];


Exactly what you want to do would depend on what you want out of the process.

Douglas Davidson

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


References: 
 >Convert HTML to plain text (From: "David Brennan" <email@hidden>)

  • Prev by Date: Re: Inserting localized text to NSTextView
  • Next by Date: NSUndoManager - grouping undo actions in Core Data
  • Previous by thread: Re: Convert HTML to plain text
  • Next by thread: Re: Convert HTML to plain text
  • Index(es):
    • Date
    • Thread