Re: HTML parsing
Re: HTML parsing
- Subject: Re: HTML parsing
- From: Frank Miedreich <email@hidden>
- Date: Tue, 3 Sep 2002 11:07:03 +0200
The first problem is going to be that few (i.e. almost no) HTML pages
are actually wellformed XML documents, thus you can't use most XML
parsers directly.
You should begin with piping the HTML page thruogh tidy
(
http://tidy.sourceforge.net/) to generate XHTML, which you could
then parse with a SAX parser to build an applescript DOM tree of the
XML document. The only applescript XML parser I am aware of is Late
Night's XMLtools, if you choose to keep the DOM tree in an osax you
can use xerces (
http://xerces.apache.org/) as the parser.
You would still need to implement the DOM API to apple event interface.
cheers, Frank
At 13:24 Uhr -0700 02.09.2002, Roger Howard wrote:
I've begun to build fairly function-specific handlers for extracting values
from discreet HTML tag attributes and I was wondering if anyone has or knows
of anything a bit more generic and tested. I have two main tasks:
1) Extract data in between a given start tag and an intelligently identified
end tag. For instance, feed it the position of a <P> and it will return all
the data between the <P> and the next </P>
2) Extract values from specified tags. For instance, feed it a tag such as
<meta name="FIELDNAME" content="Field data inserted here"> and return the
labels and values in the name and content fields as a hash array like:
(("name","FIELDNAME"),("content","Field data inserted here"))
A bonus would be the top-down parsing of an entire HTML document into a tree
of tags, attributes, and values.
Given AppleScript's ignorance of HTML/XML structures, is there a better,
more tested way of doing this? I'd hate to get into constant revisions of my
handlers to suit additional data sets, so I'm hoping maybe there's instead
either a tried-and-true Scripting Addition or a better way such as a shell
tool I can trigger from AppleScript.
Any suggestions?
Best,
Roger Howard
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.
References: | |
| >HTML parsing (From: Roger Howard <email@hidden>) |