• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: extract URL from general text
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: extract URL from general text


  • Subject: Re: extract URL from general text
  • From: Skeeve <email@hidden>
  • Date: Wed, 19 Mar 2008 00:25:34 +0100

Hudson Barton wrote:
I need to extract valid "http" URL's from general (non-html) text. I define what is valid as follows:

1. begins with "http:"
okay.

2.  preceded by " "
3.  followed by " "
better: Surrounded by \b (as defined by Perl's regexp).

4. containing only valid characters (or validly encoded characters) as per RFC1738
Why not take that url definition?

Anybody have a script handy that might work or can be adapted?
No problem. I simply took the mentioned RFC and made this little scipt:

grep_URLs(choose file)
to grep_URLs(an_alias)
return do shell script "perl -ne " & (quoted form of "print qq'$1\\n' while /\\b(http:\\/\\/(?:(?:(?:(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|(?:(?:(?:[a-z]|[A-Z])|[0-9])(?:(?:(?:[a-z]|[A-Z])|[0-9])|-)*(?:(?:[a-z]|[A-Z])|[0-9]))).)*(?:(?:[a-z]|[A-Z])|(?:[a-z]|[A-Z])(?:(?:(?:[a-z]|[A-Z])|[0-9])|-)*(?:(?:[a-z]|[A-Z])|[0-9])))|(?:(?:[0-9]+)\\.(?:[0-9]+)\\.(?:[0-9]+)\\.(?:[0-9]+)))(?::(?:(?:[0-9]+)))?)(?:\\/(?:(?:(?:(?:(?:(?:[a-z]|[A-Z])|[0-9]|[$-_.+]|[!*'(),])|(?:%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])))|;|:|@|&|=)*)(?:\\/(?:(?:(?:(?:(?:[a-z]|[A-Z])|[0-9]|[$-_.+]|[!*'(),])|(?:%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])))|;|:|@|&|=)*))*)(?:\\?(?:(?:(?:(?:(?:[a-z]|[A-Z])|[0-9]|[$-_.+]|[!*'(),])|(?:%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])))|;|:|@|&|=)*))?)?)\\b/g;") & " " & (quoted form of POSIX path of an_alias)
end grep_URLs


To be honest... I made it up from this, shlightly longer script, which is a straight-forward translation from the RFC to Perl:

#!/usr/bin/perl
use strict;
use warnings;

my $safe = "[\$\-_.+]";
my $extra = "[!*'(),]";
my $digit = "[0-9]";
my $lowalpha = "[a-z]";
my $hialpha = "[A-Z]";
my $digits = '(?:'. $digit . '+' .')';
my $alpha = '(?:'. $lowalpha .'|'. $hialpha .')';
my $hex = '(?:'. $digit .'|'. "[A-Fa-f]" .')';
my $escape = '(?:'. "%" . $hex . $hex . ')';
my $unreserved = '(?:'. $alpha .'|'. $digit .'|'. $safe .'|'. $extra .')';
my $uchar = '(?:'. $unreserved .'|'. $escape .')';
my $alphadigit = '(?:'. $alpha .'|'. $digit .')';
my $toplabel = '(?:'. $alpha .'|'. $alpha .'(?:'. $alphadigit .'|'. "-" .')*'. $alphadigit .')';
my $domainlabel = '(?:'. $alphadigit .'|'. '(?:' . $alphadigit .'(?:'. $alphadigit .'|'. "-" .')*'. $alphadigit .')' .')';
my $hostnumber = '(?:'. $digits . "\\\." . $digits . "\\\." . $digits . "\\\." . $digits .')';
my $hostname = '(?:'. '(?:'. $domainlabel . "." .')*'. $toplabel .')';
my $hsegment = '(?:'. '(?:'. $uchar .'|'. ";" .'|'. ":" .'|'. "\@" .'|'. "&" .'|'. "=" .')*' .')';
my $port = '(?:'. $digits .')';
my $host = '(?:'. $hostname .'|'. $hostnumber .')';
my $search = '(?:'. '(?:'. $uchar .'|'. ";" .'|'. ":" .'|'. "\@" .'|'. "&" .'|'. "=" .')*' .')';
my $hpath = '(?:'. $hsegment .'(?:'. "/" . $hsegment .')*' .')';
my $hostport = '(?:'. $host .'(?:'. ":" . $port .')?' .')';
my $httpurl = '(?:'. "http://"; . $hostport . '(?:' . "/" . $hpath .'(?:' . "\\\?" . $search . ')?' . ')?' .')';
my $httpRE= qr/\b($httpurl)\b/;


while (<>) {
   print "$1\n" while /$httpRE/g;
}



_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden
References: 
 >extract URL from general text (From: Hudson Barton <email@hidden>)

  • Prev by Date: Re: 10.4.9 versus 10.4.10
  • Next by Date: Re: Toxic Soup and Enough for all
  • Previous by thread: Re: extract URL from general text
  • Next by thread: Re: extract URL from general text
  • Index(es):
    • Date
    • Thread