Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: "+" and "-" are numbers.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "+" and "-" are numbers.

Subject: Re: "+" and "-" are numbers.
From: has <email@hidden>
Date: Sun, 4 Aug 2002 23:44:34 +0100

Arthur J. Knapp wrote:

> Very nice :)
>
> What can you do in the area of URL parsing? ;-)

You mean extracting URLs from a larger string? Well, it ain't easy.
Extracting email addys is pretty trivial (I wrote a fast email extractor
library myself just for the helluvit; be happy to post it [if I can find
it] for anyone that's curious), but URLs are a whole different level of
complexity. I've thought about writing one, but I've no real use for such a
beast myself and there's no way I'm going to spend my valuable time on it
[1][2]. Here's a few pointers though:

Characters to consider:

LEGAL (all valid characters within a URL):

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789$-_.+!*'(),

RESERVED (I think you'll want to consider most/all of these):

;/?:@=&

UNSAFE (some you should consider, some you should not):

<>"#%{}|\^~[]`

Protocols you might want to check for (there's probably more):

ftp
http
https
gopher
mailto
news
nntp
telnet
wais
file
prospero

Whether you write a system that searches for one specific protocol only, or
all known protocols, or a user-defined list of desired protocols, is up to
you. I'd think the last would be the preferred option.

You'll probably also want to consider addresses written without a protocol
explicitly declared (and assume them to be http), i.e. "www.foo.com" will
be recognised "http://www.foo.com";

Your extractor should also be case-insensitive. This ought to introduce an
element of fun to the proceedings: to write an efficient vanilla extractor,
you'll really want to use a TID-based offset search as iterating over each
character in the string will be abominably slow for anything other than
short strings.

So, for an efficient large-string extractor you'll need to maintain two
strings: the original, from which you extract, and a copy which you
normalise (lowercase) so you can perform TID-based searches. [I've used
this technique in the case-insensitive find-and-replace routine in
stringLib, if you want to see a practical example of what I'm talking
about.]

You can then search for the protocol and/or "www" substrings and iterate
over the characters after each found offset [ie till you find an invalid
character] in order to get the full URL (in the case of "www" extraction,
you'll need to iterate over the preceding characters as well, in case the
"www" is preceded by some sort of protocol).

You'll also have some fun handling collisions under such a system
(iterating across each character in the string may be slow as molasses, but
at least it's straightforward) - it can be done though; you just have to
implement a system for remembering the start and end positions of
previously found URL substrings and consult it every time you check a new
offset.

All in all, this should reduce a competent AS programmer to a blubbering
wreck, crying for AS to grow some built-in regular expressions to take [at
least some of] the pain away. But it could be done. You might also want to
poke around some other scripting languages (notably those that have a
decent <ptui> module system) to see if they have anything equivalent that
you could purloin.

has

[1] Time much better spent, feet up, in the back garden soaking up the
summer sun, with a cold drink in one hand and a good book in the other.
Ahhh, heaven.

[2] Unless you want to grease my palms with some of that there gold stuff,
y'know...

--
Fare thee well, o tried and trusty dialup...
As of August 1st my email address has changed from
[email@hidden] to [email@hidden]
If you love me, please update your address books accordingly.
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: "+" and "-" are numbers.
  - From: Doug McNutt <email@hidden>

Prev by Date: Re: Misc Eudora questions
Next by Date: Re: "+" and "-" are numbers.
Previous by thread: Re: "+" and "-" are numbers.
Next by thread: Re: "+" and "-" are numbers.
Index(es):
- Date
- Thread