• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Extract URLs using TIDs (how to?)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extract URLs using TIDs (how to?)


  • Subject: Extract URLs using TIDs (how to?)
  • From: Charles Arthur <email@hidden>
  • Date: Sat, 26 May 2001 00:11:11 +0100

Hi..

I'm trying to script the extraction of URLs from search results on a Web
page. I've written a version that works with Tex-Edit, but I'd prefer to
use ASTIDs - the speed difference is amazing.

However, I'm somewhat stumped on how precisely to extract the terms I want.

The HTML tends to come back in the form where the URL I want is embedded in
something like this:

<a href="/news/story/000000.html>Climber survives not climbing for long
period</a><font>all sort of other things here and lots more HTML with some
<tr><td>sorts of things thrown in and then another search result which pops
up as <a href="/news/story/000001.html>Somebody climbs something, according
to report in <a href="http:www.anothersite.com>Another site</a>.

Clearly, what I want to do is to extract /news/story/000000.html and
/news/story/000001.html - or even just the 00000 and 00001 parts of it -
while ignoring the non-relevant embedded URL.

My question is, how do I go about doing that sort of thing with text item
delimiters? I have managed to do it to extract the first URL:

--watch for line wraps in the long string below.. or it might not matter
set thelist to "<a href=\"/news/story/000000.html>Climber survives not
climbing for long period</a><font>all sort of other things here and lots
more HTML with some <tr><td>sorts of things thrown in and then another
search result which pops up as <a href=\"/news/story/000001.html>Somebody
climbs something, according to report in <a
href=\"http:www.anothersite.com\">Another site</a>."

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to "/news/story/"
--which is a unique delimiter for the stories I want

set thelist to text items 2 thru -1 of thelist
-- 2 thru -1 because the first item will obviously not include the delimiter
(* which may, or may not, be a unique delimiter;
I haven't tested in depth but the HTML results sometimes have
embedded URLs
*)
set AppleScript's text item delimiters to astid
set thelist to thelist as string
set AppleScript's text item delimiters to ".html>"
set thelist to text items of thelist
repeat with anitem in thelist
display dialog anitem as string
end repeat
set AppleScript's text item delimiters to astid
--


If I then cycle through the text items of thelist, the first item is the
digits pointing to the story (which I can then pass to URL Access
Scripting). However, the second item in the above example is never clearly
delineated.

So, how? Is it possible? In Tex-Edit, you simply find the start of the
/news/story URL, and then do a "search for ... starting from cursor". But
I'd prefer to use TIDs if possible.

Answers to the list, please as emails are currently broken - I can send but
not receive. (DNS problems.)

Charles

http://www.ukclimbing.com : 1,000+ British crags, 350+ British climbing walls
- searchable by distance rock type, etc, with 5-day weather forecasts for
every one - plus maps, articles, news, and the New Routes database. There's
even a cool shop attached...


  • Follow-Ups:
    • Re: Extract URLs using TIDs (how to?)
      • From: "asa" <email@hidden>
  • Prev by Date: Re: upload URL
  • Next by Date: Extract URLs using TIDs (email address)
  • Previous by thread: Re: upload URL
  • Next by thread: Re: Extract URLs using TIDs (how to?)
  • Index(es):
    • Date
    • Thread