• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: extracting html from text file
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: extracting html from text file


  • Subject: Re: extracting html from text file
  • From: Marcel Weiher <email@hidden>
  • Date: Wed, 3 Jul 2002 20:20:29 +0200

On Wednesday, July 3, 2002, at 05:14 Uhr, Koen van der Drift wrote:

I am just learning cocoa, and have an idea for a project. I have a plain text file, and part of the text is HTML code. the html part ofcourse starts with <html> and ends with </html>, so I have two markers. What I want to do is read the text file, scan each line until <html> is found, store the text that follows in a separate object until </html> and then save the html text. Sound simple, but at this point I have no idea what would be the best way to approach this. Can I eg read the text in an NSString, and then scan the text with regular expressions (like in perl). Or does Cocoa alread have clesses that do such things?

The following uses Objective-XML to get the character data (text) inside html tags (all tags are stripped). Currently, it will write the result to stdout, but changing [MPWByteStream Stdout] to [MPWByteStream stream] will write to an NSData that can be retrieved by sending the stream the 'target' message.

Sample run:

marcel@tuvuk[/tmp]cat test.somehtml
Previous text
<html>
html text
</html>
After text
marcel@tuvuk[/tmp]~/programming/Build/sax-parse-example test.somehtml

html text
marcel@tuvuk[/tmp]




Enjoy,

Marcel

---------- SaxClient.h ---------------

/* SaxClient.h created by marcel on Tue 07-Dec-1999 */

#import <Foundation/Foundation.h>
#import <MPWXmlKit/MPWSaxProtocol.h>

@interface SaxClient : NSObject <MPWSaxDocumentHandler>
{
int htmlTagCount;
id outStream;
}

@end


--------- SaxClient.m --------------


/* SaxClient.m created by marcel on Tue 07-Dec-1999 */

#import "SaxClient.h"
#import <MPWFoundation/MPWFoundation.h>

@implementation SaxClient

idAccessor( outStream, setOutStream )

-init
{
self=[super init];
[self setOutStream:[MPWByteStream Stdout]];
htmlTagCount=0;
return self;
}


-(void)startElement:elementName attributes:attributes
{
if ( [[elementName lowercaseString] isEqual:@"html"] ) {
htmlTagCount++;
}
}

-(void)endElement:elementName
{
if ( [[elementName lowercaseString] isEqual:@"html"] ) {
htmlTagCount--;
}
}
-(void)characters:characterData
{
if ( htmlTagCount > 0 ) {
[outStream writeObject:characterData];
}
}

-(void)cdata:characterData
{
[self characters:characterData];
}

-(void)startDocument {}
-(void)endDocument {}
-(void)setDocumentLocator:locator {}

-(void)dealloc
{
[outStream release];
[super dealloc];
}

@end

-------------- main.m -------------


#import <Foundation/Foundation.h>
#import <MPWXmlKit/MPWXmlParser.h>
#import "SaxClient.h"

int main (int argc, const char *argv[])
{
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
id client =[[[SaxClient alloc] init] autorelease];
id parser =[[[MPWXmlParser alloc] init] autorelease];

[parser setDocumentHandler:client];
[parser scan:[NSData dataWithContentsOfMappedFile:[NSString stringWithCString:argv[1]]]];
// insert your code here

[pool release];
exit(0); // insure the process exit status is 0
return 0; // ...and make main fit the ANSI spec.
}



--
Marcel Weiher Metaobject Software Technologies
email@hidden www.metaobject.com
Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

References: 
 >extracting html from text file (From: Koen van der Drift <email@hidden>)

  • Prev by Date: Re: Cocoa stripping resource forks: does Jaguar fix?
  • Next by Date: Re: extracting html from text file
  • Previous by thread: Re: extracting html from text file
  • Next by thread: Re: extracting html from text file
  • Index(es):
    • Date
    • Thread