Re: extract URL from general text
Re: extract URL from general text
- Subject: Re: extract URL from general text
- From: Hudson Barton <email@hidden>
- Date: Tue, 18 Mar 2008 21:11:12 -0400
Title: Re: extract URL from general
text
I have no interest in determining whether a string segment is a
valid URL. I just want something that is the approximate
equivalent of what I see in email programs, word processors,
spreadsheets and lots of other programs. These programs quickly
and indeed automatically highlight things that look like URL's.
That being said, I DON'T want to activate any kind of URL except for
"http" (no email, no skype, no ftp, etc.).
My specific puzzle is that I have html documents (in BBEdit) that
contain lots of URLs that need to be activated. So
"http://glimfeather.com/borderless" must be converted
to "<a
href=""
>ther.com/borderless/</a>"
The script should also NOT convert strings that are clearly
invalid URL's such as:
- http://glimfeather.com/borderless/>
- hello://glimfeather.com/borderless
- www.glimfeather.com/borderless
- http://glimfeather/borderless
- http://glimfeather.com/borderless/)
- rain in spainhttp://glimfeather.com/borderless
I've muddled with Applescript for years, and I just can't believe
that nobody has ever perfected this with vanilla Applescript or with
BBedit. Sure, Perl (which I don't know) is good for parsing
text, but so is Applescript in my experience.
"Hudson Barton" wrote:
> I need to extract valid "http" URL's from general
(non-html) text. I
> define what is valid as follows:
>
> 1. begins with "http:"
> 2. preceded by " "
> 3. followed by " "
> 4. containing only valid characters (or validly encoded
characters)
> as per RFC1738
Best you don't go re-defining things like this.
Your 'definition' wouldn't collect the url <http://google.com>
in this
message, for instance. [It would fail your (1.) and (2.) and (3.)]
If you want URLs from text, don't worry about your 'definition' at
first.
Just get things that are written to LOOK like urls.
Then, if you want, you can verify/validate each one separately,
weeding out
the problems.
For instance, if I were to refer to http://blahblahblah.co.uk, then I
told
the reader to "put your own domain in there", you're
'definition' would find
that domain, by your rules, but would ignore my previous google
example.
To me, a valid url is a url that actually points somewhere. Curl
it.
http://thisIsAValidDomainStringButNotAValidDomain.com
--
Gary
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list
(email@hidden)
Help/Unsubscribe/Update your Subscription:
>eather.com
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden