Re: Parsing HTML
Re: Parsing HTML
- Subject: Re: Parsing HTML
- From: has <email@hidden>
- Date: Sat, 4 Jan 2003 20:55:12 +0000
Randal L. Schwartz wrote:
Sal> <http://www.apple.com/applescript/guidebook/sbrt/pgs/sbrt.04.htm>
That code is, of course, flawed for *real* HTML[1], but will probably
work in many cases.
[...]
[1] Real HTML can contain
<tag1
attribute1="fo'o>b'ar"
attribute2='lef"t>r"ight'
attribute3=unquoted
>
some text
</tag1>
so you can't just scan to ">": you need to know if you're inside a
quoted attribute value or not. And notice that each kind of quotes
can contain the other kind of quotes. Yeah, messy problem, eh?
You also need to consider "<" and ">" in embedded JavaScripts, as
well as unescaped "<" and ">" in general content. Case-insensitivity
is also essential in an HTML parser, unless you're dealing with valid
XHTML (which insists on all-lowercase tags).
Real-world HTML requires a _very_ forgiving parser unless you can be
confident of its validity. I don't think anyone's ever tried to write
one in AS - the language isn't terribly fast and its built-in
text-manipulation facilities are pretty weak: you'd have a hard time
writing something that's idiot-proof that doesn't also grind terribly.
One option might be to run the files through something like HTML Tidy
before parsing them in AS. (I've not tried it, but there's a
command-line version that might be usable via 'do shell script'.)
Alternatively, use Perl or Python (which have sterner HTML parsing
libraries) to process such nonsense and dump the results into a more
organised format which AS, with its more limited faculties, can parse
easily.
As far as (unforgiving) parsing in AS goes, I'd use something like this:
======================================================================
--SIMPLE TAG PARSER
-------
--PRIVATE
on _tokenise(txt, delim)
set oldTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to delim
try
set lst to txt's text items
on error number -2706 -- stack overflow
error "Too many tags. AppleScript go boom."
end try
set AppleScript's text item delimiters to oldTID
return lst
end _tokenise
-------
--PUBLIC
on simpleTagParse(txt, receiverScript)
script kludge -- list access speed hack
property lst : _tokenise(txt, "<")
end script
set contentTxt to kludge's lst's first item
if contentTxt contains ">" then error "error parsing html file."
set kludge's lst to rest of kludge's lst
receiverScript's processContent(contentTxt)
repeat with chunkRef in kludge's lst
set txtChunks to _tokenise(chunkRef's contents, ">")
if (count of txtChunks) is not 2 then error "error parsing html file."
receiverScript's processTag(txtChunks's first item)
receiverScript's processContent(txtChunks's second item)
end repeat
return
end simpleTagParse
-------
--TEST
set txt to read alias "Macintosh HD:Users:has:test.html"
set receiverScript to load script alias "Macintosh
HD:Users:has:ReceiverScript.scpt"
simpleTagParse(txt, receiverScript)
return receiverScript's getResult()
======================================================================
Won't tolerate unescaped < and > - you'll need a much more
sophisticated system which understands HTML to cope with that. Should
be more flexible and adaptable than the Guidebook example though.
Doesn't do much; just extracts content and tags, and passes them to a
second script (ReceiverScript.scpt) which can do whatever it wants
with the data. Here's a simple demonstration that separates tags and
contents into separate lists, but you could modify it to do almost
anything with the incoming data.
======================================================================
--DEMO RECEIVER SCRIPT
(*
Used by simpleTagParse. Must contain 'processTag(txt)' and
'processContent(txt)' handlers, but you can put any code you
like into them (and the rest of the script).
*)
script _kludge -- list access speed hack
property tagsList : {}
property contentsList : {}
end script
on processTag(txt)
set _kludge's tagsList's end to txt
end processTag
on processContent(txt)
set _kludge's contentsList's end to txt
end processContent
on getResult()
return {theTags:_kludge's tagsList, theContent:_kludge's contentsList}
end getResult
======================================================================
There's also a partly-done XHTML parser + plain-text formatter on my
website if you're really curious. Not really in a condition where any
but the bravest of ASers could modify/extend it, but perhaps one of
these days I'll get around to finishing and documenting it all. (I'm
getting quite good at writing parsers, formatters, templating
libraries, etc - just not so good at finding time to polish and
release them...:)
has
--
http://www.barple.pwp.blueyonder.co.uk -- The Little Page of AppleScripts
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.