Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Regex pattern to find URLs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regex pattern to find URLs

Subject: Re: Regex pattern to find URLs
From: Kevin Ballard <email@hidden>
Date: Sat, 6 Nov 2004 22:01:31 -0500

On Nov 6, 2004, at 9:06 PM, b.bum wrote:

OK -- that is just bizarre. Send me a transcript. I'm using the stock Python interpreter as installed on Panther.

Hrm, that *is* bizarre. It's working now, despite being the exact same thing as last time.

Note that the reason I didn't think was supposed to work was because I didn't realize you could use the alternator in the root of the expression - I thought it required being in a subexpression. So I didn't read it carefully and assumed that alternator was nested in another subexpression, thus making the \( at the beginning cause the problem I described.

I wonder why it didn't work the first time then?

True, Regex's can't handle recursion. But that is more an academic issue and not a practical one. You could easily compose or dynamically generate regular expressions that allow for nesting of parens or brackets to whatever depth you might find to be reasonable -- 2 or 3 levels should be more than enough. (Frighteningly, Perl's regular expression engine *can* do recursive expressions and there is a proposal to do something similar in Python)
The weird case isn't that hard:
>>> import re
>>> x = '[http://www.foo.com/bar(]).html]'
>>> r = re.compile('^\[(?P<url>http://[^)]*\)[^]]*)')
>>> r.match(x).group('url')
'http://www.foo.com/bar(]).html'
But the weird case does illuminate a flaw with any kind of regular expression based parsing of arbitrary input. You are going to spend *a lot* of time dealing with special cases and evaluation ordering issues.

The problem with your weird cases is that they're always individual regex's and not combined into the über-regex. I'm just wondering if it's possible to combine all the cases such that it works properly for everything - you may run into cases that, depending on which way you order, will cause different URLs to parse incorrectly. But that's just because there's so many weird ways to do URLs.

I'm personally a fan of just finding the *beginning* of the URL (like, say, the http://user:pass@domain bit) and then figuring out the path based on non-regex code (because you can encode all the logic you want much easier that way).

--
Kevin Ballard
email@hidden
http://www.tildesoft.com
http://kevin.sb.org

Attachment: smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: Regex pattern to find URLs
  - From: Eric Ocean <email@hidden>
- Re: Regex pattern to find URLs
  - From: "b.bum" <email@hidden>

References:
	>Re: Regex pattern to find URLs (From: John Siracusa <email@hidden>)
	>Re: Regex pattern to find URLs (From: Kevin Ballard <email@hidden>)
	>Re: Regex pattern to find URLs (From: John Stiles <email@hidden>)
	>Re: Regex pattern to find URLs (From: "b.bum" <email@hidden>)
	>Re: Regex pattern to find URLs (From: Kevin Ballard <email@hidden>)
	>Re: Regex pattern to find URLs (From: "b.bum" <email@hidden>)
	>Re: Regex pattern to find URLs (From: Kevin Ballard <email@hidden>)
	>Re: Regex pattern to find URLs (From: "b.bum" <email@hidden>)

Prev by Date: Re: Book about creating Cocoa widgets
Next by Date: Re: Regex pattern to find URLs
Previous by thread: Re: Regex pattern to find URLs
Next by thread: Re: Regex pattern to find URLs
Index(es):
- Date
- Thread