Extract URLs using TIDs (how to?)
Extract URLs using TIDs (how to?)
- Subject: Extract URLs using TIDs (how to?)
- From: Charles Arthur <email@hidden>
- Date: Sat, 26 May 2001 00:11:11 +0100
Hi..
I'm trying to script the extraction of URLs from search results on a Web
page. I've written a version that works with Tex-Edit, but I'd prefer to
use ASTIDs - the speed difference is amazing.
However, I'm somewhat stumped on how precisely to extract the terms I want.
The HTML tends to come back in the form where the URL I want is embedded in
something like this:
<a href="/news/story/000000.html>Climber survives not climbing for long
period</a><font>all sort of other things here and lots more HTML with some
<tr><td>sorts of things thrown in and then another search result which pops
up as <a href="/news/story/000001.html>Somebody climbs something, according
to report in <a href="http:www.anothersite.com>Another site</a>.
Clearly, what I want to do is to extract /news/story/000000.html and
/news/story/000001.html - or even just the 00000 and 00001 parts of it -
while ignoring the non-relevant embedded URL.
My question is, how do I go about doing that sort of thing with text item
delimiters? I have managed to do it to extract the first URL:
--watch for line wraps in the long string below.. or it might not matter
set thelist to "<a href=\"/news/story/000000.html>Climber survives not
climbing for long period</a><font>all sort of other things here and lots
more HTML with some <tr><td>sorts of things thrown in and then another
search result which pops up as <a href=\"/news/story/000001.html>Somebody
climbs something, according to report in <a
href=\"http:www.anothersite.com\">Another site</a>."
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to "/news/story/"
--which is a unique delimiter for the stories I want
set thelist to text items 2 thru -1 of thelist
-- 2 thru -1 because the first item will obviously not include the delimiter
(* which may, or may not, be a unique delimiter;
I haven't tested in depth but the HTML results sometimes have
embedded URLs
*)
set AppleScript's text item delimiters to astid
set thelist to thelist as string
set AppleScript's text item delimiters to ".html>"
set thelist to text items of thelist
repeat with anitem in thelist
display dialog anitem as string
end repeat
set AppleScript's text item delimiters to astid
--
If I then cycle through the text items of thelist, the first item is the
digits pointing to the story (which I can then pass to URL Access
Scripting). However, the second item in the above example is never clearly
delineated.
So, how? Is it possible? In Tex-Edit, you simply find the start of the
/news/story URL, and then do a "search for ... starting from cursor". But
I'd prefer to use TIDs if possible.
Answers to the list, please as emails are currently broken - I can send but
not receive. (DNS problems.)
Charles
http://www.ukclimbing.com : 1,000+ British crags, 350+ British climbing walls
- searchable by distance rock type, etc, with 5-day weather forecasts for
every one - plus maps, articles, news, and the New Routes database. There's
even a cool shop attached...