Re: Regex pattern to find URLs
Re: Regex pattern to find URLs
- Subject: Re: Regex pattern to find URLs
- From: Kevin Ballard <email@hidden>
- Date: Sat, 6 Nov 2004 22:01:31 -0500
On Nov 6, 2004, at 9:06 PM, b.bum wrote:
OK -- that is just bizarre.  Send me a transcript.  I'm using the
stock Python interpreter as installed on Panther.
Hrm, that *is* bizarre. It's working now, despite being the exact same
thing as last time.
Note that the reason I didn't think was supposed to work was because I
didn't realize you could use the alternator in the root of the
expression - I thought it required being in a subexpression. So I
didn't read it carefully and assumed that alternator was nested in
another subexpression, thus making the \( at the beginning cause the
problem I described.
I wonder why it didn't work the first time then?
True, Regex's can't handle recursion.  But that is more an academic
issue and not a practical one.   You could easily compose or
dynamically generate regular expressions that allow for nesting of
parens or brackets to whatever depth you might find to be reasonable
-- 2 or 3 levels should be more than enough.  (Frighteningly, Perl's
regular expression engine *can* do recursive expressions and there is
a proposal to do something similar in Python)
The weird case isn't that hard:
>>> import re
>>> x = '[http://www.foo.com/bar(]).html]'
>>> r = re.compile('^\[(?P<url>http://[^)]*\)[^]]*)')
>>> r.match(x).group('url')
'http://www.foo.com/bar(]).html'
But the weird case does illuminate a flaw with any kind of regular
expression based parsing of arbitrary input.   You are going to spend
*a lot* of time dealing with special cases and evaluation ordering
issues.
The problem with your weird cases is that they're always individual
regex's and not combined into the über-regex. I'm just wondering if
it's possible to combine all the cases such that it works properly for
everything - you may run into cases that, depending on which way you
order, will cause different URLs to parse incorrectly. But that's just
because there's so many weird ways to do URLs.
I'm personally a fan of just finding the *beginning* of the URL (like,
say, the http://user:pass@domain bit) and then figuring out the path
based on non-regex code (because you can encode all the logic you want
much easier that way).
--
Kevin Ballard
email@hidden
http://www.tildesoft.com
http://kevin.sb.org
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden