• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: HTML parsing
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HTML parsing


  • Subject: Re: HTML parsing
  • From: Ken Scott <email@hidden>
  • Date: Thu, 05 Sep 2002 21:19:16 -0600

on 9/4/02 5:55 PM, has at email@hidden wrote:

> Frank Miedreich wrote:
>
>> The first problem is going to be that few (i.e. almost no) HTML pages
>> are actually wellformed XML documents, thus you can't use most XML
>> parsers directly.
>
> Aye, well there's the rub. A full-blown HTML parser is a monster of a beast
> which spends most of its time trying to deal with errors in HTML as
> forgivingly as it can. Huge amount of code, needs to know an awful lot
> about the myriad versions of HTML and its formatting rules, and be able to
> deal with even the most idiotic of malformed markup. I took a poke in
> Python's bundled HTML parser, and noticed rightaway that it's designed for
> HTML 2.0, which is an antiquated and obsolete standard (how well it handles
> poorly formed markup I don't know as I didn't try it).
>
> This is one of the problems XHTML is meant to solve, of course: well formed
> XHTML is an absolute doddle to parse, even with the most generic of
> parsers.
>
>
>> You should begin with piping the HTML page thruogh tidy
>> (http://tidy.sourceforge.net/) to generate XHTML
>
> Fair suggestion, though I don't know whether any Tidy-based apps are
> scriptable, which would likely help a lot of scripters. It should pull
> average markup into line: lowercasing tags, closing elements properly, etc.
> Whether it makes a useful job of straightening out shoddy markup is another
> question, however - human stupidity still beats machine intelligence hands
> down most every time when it comes to creating indecipherable markup.;)
>

Depending on how scriptable you needed it, you could install the
command-line version of tidy, and call it with a do script statement. If I
understand the way that the do script works, the cleaned html should be
returned to you and could go into a variable.

I'm still new at some of this scripting stuff, but I think I understand this
part of it. If not, please let me know.

Ken

>

--
<>< Ken Scott email@hidden http://www.pcisys.net/~kscott

This is the day that the Lord has made;
Let us rejoice and be glad in it -- Psalm 118:24
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

References: 
 >Re: HTML parsing (From: has <email@hidden>)

  • Prev by Date: Re: Why is a-b not equal to ((source of a) -b) ?
  • Next by Date: Overloading Additions
  • Previous by thread: Re: HTML parsing
  • Next by thread: Re: HTML parsing
  • Index(es):
    • Date
    • Thread