Re: Grabbing info from a webpage
Re: Grabbing info from a webpage
- Subject: Re: Grabbing info from a webpage
- From: Daniel Jalkut <email@hidden>
- Date: Tue, 26 Jul 2005 09:38:02 -0400
On Jul 26, 2005, at 12:33 AM, Patrick Zittle wrote:
Hey guys,
I would like to grab some information from a web page. How
can I get that information so I can present it? For example, say I
wanted to find out what the featured download from the apple
downloads page. How could I grab that information.
Thanks a lot in advanced!
You've got to tasks to perform:
1. Grab the entire web page. (easy)
2. Parse for the info you want. (harder)
How you parse it is going to depend on factors like whether you are
OK with having a browser open and showing the action as it happens,
or whether you want it to all be done quietly behind the scenes.
I often use Safari's own javascript functionality to parse pages on
my behalf. This is easier or harder depending on how many "hooks"
the page's author has given you. For instance, if the data you're
interested in is contained by a div with a specific ID, then you can
use the javascript "getElementById" function to easily locate it and
grab the contents.
In the case you mention, there isn't much to go on, but you'll notice
if you look at the source that the name of the featured download
appears in an h2 tag, immediately after the text "Featured Download"
appears alone inside an h1 tag. Using this information, I came up
with this (somewhat fragile) script. What it does is ask Safari to
go to the Apple downloads page, waits for it to finish loading, and
then inspects the content via JavaScript. The Javascript code looks
for an H1 tag with the content "Featured Download," and then assumes
that the next tag will contain the application name.
The idea of "web scraping is similar whether you do it through Safari
like this, or with other tools after grabbing the entire content.
Since I bet you want this to be done quietly in the background, you
might use curl to fetch the page content, then use whatever means at
your disposal to search for the expected text within it. Here is an
example that works today, at least:
-- Fetch the page contents
set myTargetURL to "http://www.apple.com/downloads/macosx/"
set myHTML to do shell script "curl " & myTargetURL
-- Reduce the size of the examined text to only the area immediately
near the "Featured Download" text
set fdOffset to offset of "<h1>Featured Download</h1>" in myHTML
set shortText to characters fdOffset through (fdOffset + 1000) of
myHTML as string
-- Locate the text of interest by getting the start and stop offsets
based on
-- the expected container tags
set startTag to "<h2>"
set startOffset to (offset of startTag in shortText) + (length of
startTag)
set endTag to "</h2>"
set endOffset to (offset of endTag in shortText) - 1
-- Now we have the text!
display dialog "The Apple Featured Download is " & characters
startOffset through endOffset of shortText
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden