URL parsing [was Re: "+" and "-" are numbers.]
URL parsing [was Re: "+" and "-" are numbers.]
- Subject: URL parsing [was Re: "+" and "-" are numbers.]
- From: Nigel Garvey <email@hidden>
- Date: Tue, 6 Aug 2002 01:43:22 +0100
has wrote on Sun, 4 Aug 2002 23:44:34 +0100:
>
Arthur J. Knapp wrote:
>
>
> Very nice :)
>
>
>
> What can you do in the area of URL parsing? ;-)
>
>
You mean extracting URLs from a larger string? Well, it ain't easy.
>
Extracting email addys is pretty trivial (I wrote a fast email extractor
>
library myself just for the helluvit; be happy to post it [if I can find
>
it] for anyone that's curious), but URLs are a whole different level of
>
complexity. I've thought about writing one, but I've no real use for such a
>
beast myself and there's no way I'm going to spend my valuable time on it
Here's something that needs to be developed (and optimised) by someone
with more knowledge of URL protocols than myself. It only *extracts*
candidate URL's. It doesn't test their validity or try to standardise
their cases. One or two of the lines are quite long, but the line wraps
should be obvious:
on extractURLs from str
set theURLs to {}
set astid to AppleScript's text item delimiters
considering punctuation and white space but ignoring case
if str contains " www." then
set AppleScript's text item delimiters to {" www."}
set str to str's text items
set AppleScript's text item delimiters to {"
http://www."}
set str to str as string
end if
if str contains "<www." then
set AppleScript's text item delimiters to {"<www."}
set str to str's text items
set AppleScript's text item delimiters to {"<
http://www."}
set str to str as string
end if
if str contains ":" then
set AppleScript's text item delimiters to {":"}
set theTextItems to str's text items
set spaceRtn to space & return
repeat with i from 2 to (count theTextItems)
set prtcl to the last word of item (i - 1) of theTextItems
if (prtcl is in "ftp https gopher mailto news nntp telnet wais
file prospero") and (item (i - 1) of theTextItems ends with prtcl) and
(the first character of item i of theTextItems is not in spaceRtn) then
repeat with j from 1 to (count item i of theTextItems)
if character j of item i of theTextItems is in "> " then
set j to j - 1
exit repeat
end if
end repeat
-- For URL's at the end of sentences
if character j of item i of theTextItems is "." then set j to
j - 1
set the end of theURLs to prtcl & ":" & text 1 thru j of item
i of theTextItems
end if
end repeat
end if
end considering
set AppleScript's text item delimiters to astid
return theURLs
end extractURLs
set str to "This is a string containing the URL:
<www.fred.com/>.
It's nice, isn't it? Also:
mailto:email@hidden."
extractURLs from str
--> {"
http://www.fred.com/", "
mailto:email@hidden"}
NG
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.