Re: HTML parsing
Re: HTML parsing
- Subject: Re: HTML parsing
- From: Ken Scott <email@hidden>
- Date: Thu, 05 Sep 2002 21:19:16 -0600
on 9/4/02 5:55 PM, has at email@hidden wrote:
>
Frank Miedreich wrote:
>
>
> The first problem is going to be that few (i.e. almost no) HTML pages
>
> are actually wellformed XML documents, thus you can't use most XML
>
> parsers directly.
>
>
Aye, well there's the rub. A full-blown HTML parser is a monster of a beast
>
which spends most of its time trying to deal with errors in HTML as
>
forgivingly as it can. Huge amount of code, needs to know an awful lot
>
about the myriad versions of HTML and its formatting rules, and be able to
>
deal with even the most idiotic of malformed markup. I took a poke in
>
Python's bundled HTML parser, and noticed rightaway that it's designed for
>
HTML 2.0, which is an antiquated and obsolete standard (how well it handles
>
poorly formed markup I don't know as I didn't try it).
>
>
This is one of the problems XHTML is meant to solve, of course: well formed
>
XHTML is an absolute doddle to parse, even with the most generic of
>
parsers.
>
>
>
> You should begin with piping the HTML page thruogh tidy
>
> (http://tidy.sourceforge.net/) to generate XHTML
>
>
Fair suggestion, though I don't know whether any Tidy-based apps are
>
scriptable, which would likely help a lot of scripters. It should pull
>
average markup into line: lowercasing tags, closing elements properly, etc.
>
Whether it makes a useful job of straightening out shoddy markup is another
>
question, however - human stupidity still beats machine intelligence hands
>
down most every time when it comes to creating indecipherable markup.;)
>
Depending on how scriptable you needed it, you could install the
command-line version of tidy, and call it with a do script statement. If I
understand the way that the do script works, the cleaned html should be
returned to you and could go into a variable.
I'm still new at some of this scripting stuff, but I think I understand this
part of it. If not, please let me know.
Ken
>
--
<>< Ken Scott email@hidden
http://www.pcisys.net/~kscott
This is the day that the Lord has made;
Let us rejoice and be glad in it -- Psalm 118:24
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.