Re: Convert HTML to plain text
Re: Convert HTML to plain text
- Subject: Re: Convert HTML to plain text
- From: Douglas Davidson <email@hidden>
- Date: Wed, 25 Apr 2007 14:46:37 -0700
On Apr 18, 2007, at 1:43 AM, David Brennan wrote:
I need these description's in plain text. Some are plain text and some
are HTML. How can I convert an NSString that contains HTML to just the
text.
Here's a copy of what I wrote on this issue earlier:
"NSAttributedString's HTML import feature will do this, but it's too
heavyweight to really be thought of as "HTML stripping". It will
give you a full rich-text copy of the HTML--the equivalent of
selecting and copying from Safari and pasting in TextEdit, for
example--from which you are then going to discard all of the formatting.
If you just want to extract plain text from HTML, you can do it with
NSXMLDocument, using the NSXMLDocumentTidyHTML option. That will
avoid dealing with the formatting, and so should be significantly
faster.
Bear in mind that the notion of the plain-text content of HTML is not
well-defined in general. The most troublesome issue is whitespace.
For example, in HTML the relationship between two adjacent paragraphs
in text, or between two adjacent cells in a table, is a logical one
that is represented in the rendered result by a certain spatial
offset. In a plain-text representation one might prefer to have this
represented by certain whitespace characters, but it is not
necessarily obvious which ones are to be chosen, and in practice any
two different plain-text conversion mechanisms are likely to give
different results. Whitespace within the HTML source itself, on the
other hand, is supposed to be of limited significance, since the HTML
specification calls for it to be collapsed under most circumstances;
that may or may not occur under any particular plain-text conversion
mechanism.
Another issue is generated content, such as list markers, which does
not actually exist within the HTML itself, but instead is generated
at rendering time. A simple plain-text conversion mechanism may
simply ignore it; a more complex one may represent it in one way or
another, but there may not necessarily be a suitable plain-text
representation in all cases. Because of issues of this sort, you may
need to consider what it is that you actually want out of an HTML to
plain-text conversion process, and examine the various options
available to you to see how well they agree with what you want."
For reference, here is some code I've run across that uses
NSXMLDocument:
NSXMLDocument *doc = [[NSXMLDocument alloc]
initWithXMLString:html options:NSXMLDocumentTidyHTML error:nil];
result = [[doc stringValue] stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceAndNewlineCharacterSet]];
[doc release];
Exactly what you want to do would depend on what you want out of the
process.
Douglas Davidson
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden