• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Convert HTML to plain text
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Convert HTML to plain text


  • Subject: Re: Convert HTML to plain text
  • From: "Michael G. Ströck" <email@hidden>
  • Date: Wed, 18 Apr 2007 13:42:19 +0200

Unless you are writing this for kicks or have very special needs, you are doing a lot of work that has already been done for you. Take a look at the source code for Vienna, for example. It's open source under a copyleft license, so you can even use the code in a closed- source project: http://sourceforge.net/projects/vienna-rss

Best,
Michael Ströck

P.S.: Here's some very basic tag-stripping code:

/* stringByRemovingHTML
* Returns an autoreleased instance of the specified string with all HTML tags removed.
*/
+(NSString *)stringByRemovingHTML:(NSString *)theString
{
NSMutableString * aString = [NSMutableString stringWithString:theString];
int maxChrs = [theString length];
int cutOff = 150;
int indexOfChr = 0;
int tagLength = 0;
int tagStartIndex = 0;
BOOL isInQuote = NO;
BOOL isInTag = NO;


// Rudimentary HTML tag parsing. This could be done by initWithHTML on an attributed string
// and extracting the raw string but initWithHTML cannot be invoked within an NSURLConnection
// callback which is where this is probably liable to be used.
while (indexOfChr < maxChrs)
{
unichar ch = [aString characterAtIndex:indexOfChr];
if (isInTag)
++tagLength;
else if (indexOfChr >= cutOff)
break;

if (ch == '"')
isInQuote = !isInQuote;
else if (ch == '<' && !isInQuote)
{
isInTag = YES;
tagStartIndex = indexOfChr;
tagLength = 0;
}
else if (ch == '>' && isInTag)
{
if (++tagLength > 2)
{
NSRange tagRange = NSMakeRange(tagStartIndex, tagLength);
NSString * tag = [[aString substringWithRange:tagRange] lowercaseString];
int indexOfTagName = 1;


// Extract the tag name
if ([tag characterAtIndex:indexOfTagName] == '/')
++indexOfTagName;

int chIndex = indexOfTagName;
unichar ch = [tag characterAtIndex:chIndex];
while (chIndex < tagLength && [[NSCharacterSet lowercaseLetterCharacterSet] characterIsMember:ch])
ch = [tag characterAtIndex:++chIndex];

NSString * tagName = [tag substringWithRange:NSMakeRange (indexOfTagName, chIndex - indexOfTagName)];
[aString deleteCharactersInRange:tagRange];


// Replace <br> and </p> with newlines
if ([tagName isEqualToString:@"br"] || [tag isEqualToString:@"<p>"] || [tag isEqualToString:@"<div>"])
[aString insertString:@"\n" atIndex:tagRange.location];


// Reset scan to the point where the tag started minus one because
// we bump up indexOfChr at the end of the loop.
indexOfChr = tagStartIndex - 1;
maxChrs = [aString length];
isInTag = NO;
isInQuote = NO; // Fix problem with Tribe.net feeds that have bogus quotes in HTML tags
}
}
++indexOfChr;
}

if (maxChrs > cutOff)
[aString deleteCharactersInRange:NSMakeRange(cutOff, maxChrs - cutOff)];

return [aString stringByUnescapingExtendedCharacters];
}



Am 18.04.2007 um 10:43 schrieb David Brennan:

Hi,

I'm working on a feed reader. Some RSS items have a description that
contains HTML. From what I can see, the HTML that comes in these RSS
items is not a full HTML page but only the HTML between the <body>
tags.

I need these description's in plain text. Some are plain text and some
are HTML. How can I convert an NSString that contains HTML to just the
text.

Kind regards,
Dave.
_______________________________________________

Cocoa-dev mailing list (email@hidden)

Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


References: 
 >Convert HTML to plain text (From: "David Brennan" <email@hidden>)

  • Prev by Date: Re: NSDocumentController recentDocumentURLs hiding non-file based URLs
  • Next by Date: Re: A bug with NSWindow's convertBaseToScreen: method ?
  • Previous by thread: Re: Convert HTML to plain text
  • Next by thread: Re: Convert HTML to plain text
  • Index(es):
    • Date
    • Thread