• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Regex pattern to find URLs
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regex pattern to find URLs


  • Subject: Re: Regex pattern to find URLs
  • From: "b.bum" <email@hidden>
  • Date: Sat, 6 Nov 2004 17:37:59 -0800

On Nov 6, 2004, at 1:54 PM, Kevin Ballard wrote:
Uhh, that first match won't even work - your regex requires a ( at the beginning of the string.

It works fine. That was copy/pasted directly from a Terminal session that demonstrated it working. You could fire up the Python interpreter and do the same.


It was also a demonstration as to why I do almost all regular expression work in Python. Between the named subexpressions, the ability to experiment within the command line interpreter, and the ease with which one can put together a test harness, it is the only way I can do complex regular expressions and remain sane. Since most regular expression engines share most of their syntax, I will typically build the regex's in Python, then port to the target engine as needed (generally, by eliminating the named subexpressions).

I do see what you mean how if you use an alternation you could possibly get a URL surrounded by ()'s, but then what about <> and []? And then what about (http://www.foo.com/bar(blah).html)? Humans can tell the inner parens are for the URL but I can't imagine how a regex can.

It is just a matter of stringing together enough regex goop to make it work. It only breaks down when there are true ambiguities. (http://www.foo.com/bar(blah).html) can be matched. You would need to use a subexpression something like this:


	^\(http:[^(]*\([^)]*\)[^)]*

That is, skip the paren before the http:, consume all characters up to an inner (, consume the inner (, consume all characters up to inner ), consume inner ), consume all characters up to last ), done.

Regular expressions are just big state maps. If a human can read a string and tell, without ambiguity, what is and is not a part of the URL, then a human can write a regular expression to do the same.

However, the ultimate challenge remains: ensure that the subexpressions are evaluated in most specific to least specific order.

Again, from a code clarity standpoint, really complex regular expressions with lots of subexpressions are a nightmare to maintain. I would highly recommend maintaining an array of pre-compiled regular expressions that are evaluated in order against the candidate match string.

This also begs for one honkin' big unit test. Create a file that contains a huge selection of URLs that you want to be abel to match properly, then write a test case that runs the file through your set of regular expressions to ensure that matching behaves as you like. As edge cases are identified, add them to the input file, make sure the test fails, then fix your regular expressions (or add a new one).

b.bum

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: Regex pattern to find URLs
      • From: Kevin Ballard <email@hidden>
References: 
 >Re: Regex pattern to find URLs (From: John Siracusa <email@hidden>)
 >Re: Regex pattern to find URLs (From: Kevin Ballard <email@hidden>)
 >Re: Regex pattern to find URLs (From: John Stiles <email@hidden>)
 >Re: Regex pattern to find URLs (From: "b.bum" <email@hidden>)
 >Re: Regex pattern to find URLs (From: Kevin Ballard <email@hidden>)

  • Prev by Date: Re: Book about creating Cocoa widgets
  • Next by Date: Re: Regex pattern to find URLs
  • Previous by thread: Re: Regex pattern to find URLs
  • Next by thread: Re: Regex pattern to find URLs
  • Index(es):
    • Date
    • Thread