Re: extracting html from text file
Re: extracting html from text file
- Subject: Re: extracting html from text file
- From: Marcel Weiher <email@hidden>
- Date: Wed, 3 Jul 2002 20:20:29 +0200
On Wednesday, July 3, 2002, at 05:14 Uhr, Koen van der Drift wrote:
I am just learning cocoa, and have an idea for a project. I have a
plain text file, and part of the text is HTML code. the html part
ofcourse starts with <html> and ends with </html>, so I have two
markers. What I want to do is read the text file, scan each line until
<html> is found, store the text that follows in a separate object until
</html> and then save the html text. Sound simple, but at this point I
have no idea what would be the best way to approach this. Can I eg read
the text in an NSString, and then scan the text with regular
expressions (like in perl). Or does Cocoa alread have clesses that do
such things?
The following uses Objective-XML to get the character data (text) inside
html tags (all tags are stripped). Currently, it will write the result
to stdout, but changing [MPWByteStream Stdout] to [MPWByteStream
stream] will write to an NSData that can be retrieved by sending the
stream the 'target' message.
Sample run:
marcel@tuvuk[/tmp]cat test.somehtml
Previous text
<html>
html text
</html>
After text
marcel@tuvuk[/tmp]~/programming/Build/sax-parse-example test.somehtml
html text
marcel@tuvuk[/tmp]
Enjoy,
Marcel
---------- SaxClient.h ---------------
/* SaxClient.h created by marcel on Tue 07-Dec-1999 */
#import <Foundation/Foundation.h>
#import <MPWXmlKit/MPWSaxProtocol.h>
@interface SaxClient : NSObject <MPWSaxDocumentHandler>
{
int htmlTagCount;
id outStream;
}
@end
--------- SaxClient.m --------------
/* SaxClient.m created by marcel on Tue 07-Dec-1999 */
#import "SaxClient.h"
#import <MPWFoundation/MPWFoundation.h>
@implementation SaxClient
idAccessor( outStream, setOutStream )
-init
{
self=[super init];
[self setOutStream:[MPWByteStream Stdout]];
htmlTagCount=0;
return self;
}
-(void)startElement:elementName attributes:attributes
{
if ( [[elementName lowercaseString] isEqual:@"html"] ) {
htmlTagCount++;
}
}
-(void)endElement:elementName
{
if ( [[elementName lowercaseString] isEqual:@"html"] ) {
htmlTagCount--;
}
}
-(void)characters:characterData
{
if ( htmlTagCount > 0 ) {
[outStream writeObject:characterData];
}
}
-(void)c
data:characterData
{
[self characters:characterData];
}
-(void)startDocument {}
-(void)endDocument {}
-(void)setDocumentLocator:locator {}
-(void)dealloc
{
[outStream release];
[super dealloc];
}
@end
-------------- main.m -------------
#import <Foundation/Foundation.h>
#import <MPWXmlKit/MPWXmlParser.h>
#import "SaxClient.h"
int main (int argc, const char *argv[])
{
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
id client =[[[SaxClient alloc] init] autorelease];
id parser =[[[MPWXmlParser alloc] init] autorelease];
[parser setDocumentHandler:client];
[parser scan:[NSData dataWithContentsOfMappedFile:[NSString
stringWithCString:argv[1]]]];
// insert your code here
[pool release];
exit(0); // insure the process exit status is 0
return 0; // ...and make main fit the ANSI spec.
}
--
Marcel Weiher Metaobject Software Technologies
email@hidden www.metaobject.com
Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.