Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: HTML parsing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HTML parsing

Subject: Re: HTML parsing
From: has <email@hidden>
Date: Thu, 5 Sep 2002 00:55:27 +0100

Frank Miedreich wrote:

>The first problem is going to be that few (i.e. almost no) HTML pages
>are actually wellformed XML documents, thus you can't use most XML
>parsers directly.

Aye, well there's the rub. A full-blown HTML parser is a monster of a beast
which spends most of its time trying to deal with errors in HTML as
forgivingly as it can. Huge amount of code, needs to know an awful lot
about the myriad versions of HTML and its formatting rules, and be able to
deal with even the most idiotic of malformed markup. I took a poke in
Python's bundled HTML parser, and noticed rightaway that it's designed for
HTML 2.0, which is an antiquated and obsolete standard (how well it handles
poorly formed markup I don't know as I didn't try it).

This is one of the problems XHTML is meant to solve, of course: well formed
XHTML is an absolute doddle to parse, even with the most generic of
parsers.

>You should begin with piping the HTML page thruogh tidy
>(http://tidy.sourceforge.net/) to generate XHTML

Fair suggestion, though I don't know whether any Tidy-based apps are
scriptable, which would likely help a lot of scripters. It should pull
average markup into line: lowercasing tags, closing elements properly, etc.
Whether it makes a useful job of straightening out shoddy markup is another
question, however - human stupidity still beats machine intelligence hands
down most every time when it comes to creating indecipherable markup.;)

>You would still need to implement the DOM API to apple event interface.

Mmmm, I have a hunch the OP probably isn't looking to write his own C-based
HTML parser here...:) Besides, you may be better to do most or all of the
work in AS; you'll get more flexibility that way.

The lack of any decent (X)HTML parser in AS is a bit of a problem. A
scriptable app may provide a solution of sorts, XML Tools sort of provides
a solution too (if you're parsing XHTML, _and_ you can stand tag attributes
being output as horribly inflexible records), but a lot really depends on
what you're wanting to do with the HTML you're parsing: dumping it into a
DOM may not be appropriate if all you're wanting to do is extract links [1]
or return the content stripped of links.

I've been meaning to write a vanilla XHTML parser for a while now, and I've
already done some of it: tag extraction, parsing attributes into
associative arrays, white space normalisation. The bits I haven't done yet
are entity decoding, which is relatively trivial, and a public interface,
which is not (being primarily a design problem, with a bit of swearing at
AS's limited language features on the side). I could probably post the code
to my site sometime if folk are interested. And if anyone's got any good
ideas for an interface, feel free to drop me a note (I figure it'll need to
be an OO design, but I'm still a bit fuzzy resolving the details).

Cheers,

has

[1] Link extraction is pretty simple, mind you, and can just about be
handled with a decent regular expression, rather than a parser.

--
http://www.barple.pwp.blueyonder.co.uk -- The Little Page of AppleScripts
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: HTML parsing
  - From: Ken Scott <email@hidden>

Prev by Date: Re: Extracting text from html
Next by Date: Re: Unicode 'as string' = unicode?
Previous by thread: Re: HTML parsing
Next by thread: Re: HTML parsing
Index(es):
- Date
- Thread