• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: HTML parsing
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HTML parsing


  • Subject: Re: HTML parsing
  • From: Frank Miedreich <email@hidden>
  • Date: Tue, 3 Sep 2002 11:07:03 +0200

The first problem is going to be that few (i.e. almost no) HTML pages are actually wellformed XML documents, thus you can't use most XML parsers directly.

You should begin with piping the HTML page thruogh tidy (http://tidy.sourceforge.net/) to generate XHTML, which you could then parse with a SAX parser to build an applescript DOM tree of the XML document. The only applescript XML parser I am aware of is Late Night's XMLtools, if you choose to keep the DOM tree in an osax you can use xerces (http://xerces.apache.org/) as the parser.

You would still need to implement the DOM API to apple event interface.

cheers, Frank

At 13:24 Uhr -0700 02.09.2002, Roger Howard wrote:
I've begun to build fairly function-specific handlers for extracting values
from discreet HTML tag attributes and I was wondering if anyone has or knows
of anything a bit more generic and tested. I have two main tasks:

1) Extract data in between a given start tag and an intelligently identified
end tag. For instance, feed it the position of a <P> and it will return all
the data between the <P> and the next </P>
2) Extract values from specified tags. For instance, feed it a tag such as
<meta name="FIELDNAME" content="Field data inserted here"> and return the
labels and values in the name and content fields as a hash array like:
(("name","FIELDNAME"),("content","Field data inserted here"))

A bonus would be the top-down parsing of an entire HTML document into a tree
of tags, attributes, and values.

Given AppleScript's ignorance of HTML/XML structures, is there a better,
more tested way of doing this? I'd hate to get into constant revisions of my
handlers to suit additional data sets, so I'm hoping maybe there's instead
either a tried-and-true Scripting Addition or a better way such as a shell
tool I can trigger from AppleScript.

Any suggestions?

Best,

Roger Howard
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

  • Follow-Ups:
    • Re: HTML parsing
      • From: Philip Aker <email@hidden>
References: 
 >HTML parsing (From: Roger Howard <email@hidden>)

  • Prev by Date: ScriptMenu oddity
  • Next by Date: Re: Help! Need to get text from OS X Mail to Tex-Edit...
  • Previous by thread: Re: HTML parsing
  • Next by thread: Re: HTML parsing
  • Index(es):
    • Date
    • Thread