Re: Removing html tags
Re: Removing html tags
- Subject: Re: Removing html tags
- From: has <email@hidden>
- Date: Tue, 1 Mar 2005 12:01:17 +0000
Marc K. Myers wrote:
What it can't handle is text like "If x<3 and y>10, what are the solutions?"
That's invalid HTML - the < and > symbols should be escaped as <
and > - though not an uncommon mistake. A forgiving browser will
simply check the '3' against its list of known HTML element names and
when it doesn't find a match it'll assume the < and > symbols are
actually intended as content, not tags, and escape them itself. Your
average real-world web browser is filled with code to deal with
goofy, malformed and deeply broken HTML.
BTW, if anyone really is mad enough to write their own HTML parser
from scratch, this will get you started:
http://applemods.sourceforge.net/mods/Internet/HTMLParser.php
Simple SAX-style parser, basically a vanilla AS port of Python's
HTMLParser module and much smarter than your average naive regex or
TID-based [non-]solution. To build a tag stripper you'll need to
provide your own HTML entity decoding and whitespace handling, plus
some sort of state machine to make sense of various significant tags
(mostly block-level tags like <head>, <title>, <p>, <li>, <hr>, etc.
and the odd inline one like <br>). All quite doable - I once wrote a
very basic pretty-printed plain-text renderer just for kicks. But it
requires a fair bit of knowledge of program design and HTML and
writing lots and lots of code and lookup tables to pull off, so is
both mind-numbingly boring and ultimately pointless when there are
already third-party solutions that have solved this problem properly.
HTH
has
--
http://freespace.virgin.net/hamish.sanderson/
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden