RE: HTML Parsing in Objective-C?
RE: HTML Parsing in Objective-C?
- Subject: RE: HTML Parsing in Objective-C?
- From: "John Stiles" <email@hidden>
- Date: Wed, 17 Nov 2004 16:03:08 -0800
- Thread-topic: HTML Parsing in Objective-C?
Valid HTML isn't even supposed to be well-formed XML--that's a common misconception. HTML is a type of SGML markup, which is kind of like XML but certainly not one and the same.
For example, consider the tag <img src="foo.gif">. In XML you are required to close it, either via </img> or inline with <img src="foo.gif" />. In HTML, you aren't supposed to close the <img> tag at all--the inline closing style <img src="foo.gif" /> might be accepted by your browser and might not be, depending on how it handles garbage characters inside a tag, but the </img> way is clearly incorrect HTML and will probably cause your browser to choke unless it is extremely lenient about unbalanced tags in the hierarchy.
XHTML is XML, but made to look as close to HTML as possible. Pragmatically, there's little reason to use XHTML for general purpose web pages, since it causes glitches in many common browsers. That's unfortunate, since it makes for better markup overall.
-----Original Message-----
From: cocoa-dev-bounces+jstiles=email@hidden [mailto:cocoa-dev-bounces+jstiles=email@hidden] On Behalf Of Agent M
Sent: Wednesday, November 17, 2004 3:43 PM
To: email@hidden
Subject: Re: HTML Parsing in Objective-C?
Note that common HTML is rarely well-formed, valid XML (XHTML) which
makes parsing generic HTML with an XML parser an exercise in futility.
Of course, if the XHTML is known to be conformant, then this point is
irrelevant.
For HTML parsing, the general consensus is that Perl's HTML::Parser
takes the cake. http://search.cpan.org/~gaas/HTML-Parser-3.38/Parser.pm
The easiest way to get this module running in a cocoa app is with the
Perl-ObjC bridge.
A second option would be to hook directly into the HTML::Parser's SGML
backend with C.
I have used HTML::Parser with great success on even really poor
non-compliant HTML.
On Nov 17, 2004, at 6:20 PM, Mont Rothstein wrote:
> http://sope.opengroupware.org
>
> Has an Object-C wrapper around libxml2 which can be used to parse
> HTML.
>
> The framework has both DOM and SAX support.
>
> The XML processing section is:
>
> http://sope.opengroupware.org/en/sope_xml/index.html
>
> -Mont
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
email@hidden
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden