Re: Convert HTML to plain text
Re: Convert HTML to plain text
- Subject: Re: Convert HTML to plain text
- From: "Michael G. Ströck" <email@hidden>
- Date: Wed, 18 Apr 2007 13:42:19 +0200
Unless you are writing this for kicks or have very special needs, you
are doing a lot of work that has already been done for you. Take a
look at the source code for Vienna, for example. It's open source
under a copyleft license, so you can even use the code in a closed-
source project: http://sourceforge.net/projects/vienna-rss
Best,
Michael Ströck
P.S.: Here's some very basic tag-stripping code:
/* stringByRemovingHTML
* Returns an autoreleased instance of the specified string with all
HTML tags removed.
*/
+(NSString *)stringByRemovingHTML:(NSString *)theString
{
NSMutableString * aString = [NSMutableString
stringWithString:theString];
int maxChrs = [theString length];
int cutOff = 150;
int indexOfChr = 0;
int tagLength = 0;
int tagStartIndex = 0;
BOOL isInQuote = NO;
BOOL isInTag = NO;
// Rudimentary HTML tag parsing. This could be done by initWithHTML
on an attributed string
// and extracting the raw string but initWithHTML cannot be invoked
within an NSURLConnection
// callback which is where this is probably liable to be used.
while (indexOfChr < maxChrs)
{
unichar ch = [aString characterAtIndex:indexOfChr];
if (isInTag)
++tagLength;
else if (indexOfChr >= cutOff)
break;
if (ch == '"')
isInQuote = !isInQuote;
else if (ch == '<' && !isInQuote)
{
isInTag = YES;
tagStartIndex = indexOfChr;
tagLength = 0;
}
else if (ch == '>' && isInTag)
{
if (++tagLength > 2)
{
NSRange tagRange = NSMakeRange(tagStartIndex, tagLength);
NSString * tag = [[aString substringWithRange:tagRange]
lowercaseString];
int indexOfTagName = 1;
// Extract the tag name
if ([tag characterAtIndex:indexOfTagName] == '/')
++indexOfTagName;
int chIndex = indexOfTagName;
unichar ch = [tag characterAtIndex:chIndex];
while (chIndex < tagLength && [[NSCharacterSet
lowercaseLetterCharacterSet] characterIsMember:ch])
ch = [tag characterAtIndex:++chIndex];
NSString * tagName = [tag substringWithRange:NSMakeRange
(indexOfTagName, chIndex - indexOfTagName)];
[aString deleteCharactersInRange:tagRange];
// Replace <br> and </p> with newlines
if ([tagName isEqualToString:@"br"] || [tag
isEqualToString:@"<p>"] || [tag isEqualToString:@"<div>"])
[aString insertString:@"\n" atIndex:tagRange.location];
// Reset scan to the point where the tag started minus one because
// we bump up indexOfChr at the end of the loop.
indexOfChr = tagStartIndex - 1;
maxChrs = [aString length];
isInTag = NO;
isInQuote = NO; // Fix problem with Tribe.net feeds that have
bogus quotes in HTML tags
}
}
++indexOfChr;
}
if (maxChrs > cutOff)
[aString deleteCharactersInRange:NSMakeRange(cutOff, maxChrs -
cutOff)];
return [aString stringByUnescapingExtendedCharacters];
}
Am 18.04.2007 um 10:43 schrieb David Brennan:
Hi,
I'm working on a feed reader. Some RSS items have a description that
contains HTML. From what I can see, the HTML that comes in these RSS
items is not a full HTML page but only the HTML between the <body>
tags.
I need these description's in plain text. Some are plain text and some
are HTML. How can I convert an NSString that contains HTML to just the
text.
Kind regards,
Dave.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden