Re: Regex pattern to find URLs
Re: Regex pattern to find URLs
- Subject: Re: Regex pattern to find URLs
- From: "b.bum" <email@hidden>
- Date: Sat, 6 Nov 2004 17:37:59 -0800
On Nov 6, 2004, at 1:54 PM, Kevin Ballard wrote:
Uhh, that first match won't even work - your regex requires a ( at the
beginning of the string.
It works fine. That was copy/pasted directly from a Terminal session
that demonstrated it working. You could fire up the Python interpreter
and do the same.
It was also a demonstration as to why I do almost all regular
expression work in Python. Between the named subexpressions, the
ability to experiment within the command line interpreter, and the ease
with which one can put together a test harness, it is the only way I
can do complex regular expressions and remain sane. Since most regular
expression engines share most of their syntax, I will typically build
the regex's in Python, then port to the target engine as needed
(generally, by eliminating the named subexpressions).
I do see what you mean how if you use an alternation you could
possibly get a URL surrounded by ()'s, but then what about <> and []?
And then what about (http://www.foo.com/bar(blah).html)? Humans can
tell the inner parens are for the URL but I can't imagine how a regex
can.
It is just a matter of stringing together enough regex goop to make it
work. It only breaks down when there are true ambiguities.
(http://www.foo.com/bar(blah).html) can be matched. You would need to
use a subexpression something like this:
^\(http:[^(]*\([^)]*\)[^)]*
That is, skip the paren before the http:, consume all characters up to
an inner (, consume the inner (, consume all characters up to inner ),
consume inner ), consume all characters up to last ), done.
Regular expressions are just big state maps. If a human can read a
string and tell, without ambiguity, what is and is not a part of the
URL, then a human can write a regular expression to do the same.
However, the ultimate challenge remains: ensure that the
subexpressions are evaluated in most specific to least specific order.
Again, from a code clarity standpoint, really complex regular
expressions with lots of subexpressions are a nightmare to maintain.
I would highly recommend maintaining an array of pre-compiled regular
expressions that are evaluated in order against the candidate match
string.
This also begs for one honkin' big unit test. Create a file that
contains a huge selection of URLs that you want to be abel to match
properly, then write a test case that runs the file through your set of
regular expressions to ensure that matching behaves as you like. As
edge cases are identified, add them to the input file, make sure the
test fails, then fix your regular expressions (or add a new one).
b.bum
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden