Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Parsing HTML

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing HTML

Subject: Re: Parsing HTML
From: has <email@hidden>
Date: Sat, 4 Jan 2003 20:55:12 +0000

Randal L. Schwartz wrote:

Sal> <http://www.apple.com/applescript/guidebook/sbrt/pgs/sbrt.04.htm>

That code is, of course, flawed for *real* HTML[1], but will probably
work in many cases.
[...]
[1] Real HTML can contain

<tag1
attribute1="fo'o>b'ar"
attribute2='lef"t>r"ight'
attribute3=unquoted
>
some text
</tag1>

so you can't just scan to ">": you need to know if you're inside a
quoted attribute value or not. And notice that each kind of quotes
can contain the other kind of quotes. Yeah, messy problem, eh?

You also need to consider "<" and ">" in embedded JavaScripts, as well as unescaped "<" and ">" in general content. Case-insensitivity is also essential in an HTML parser, unless you're dealing with valid XHTML (which insists on all-lowercase tags).

Real-world HTML requires a _very_ forgiving parser unless you can be confident of its validity. I don't think anyone's ever tried to write one in AS - the language isn't terribly fast and its built-in text-manipulation facilities are pretty weak: you'd have a hard time writing something that's idiot-proof that doesn't also grind terribly.

One option might be to run the files through something like HTML Tidy before parsing them in AS. (I've not tried it, but there's a command-line version that might be usable via 'do shell script'.) Alternatively, use Perl or Python (which have sterner HTML parsing libraries) to process such nonsense and dump the results into a more organised format which AS, with its more limited faculties, can parse easily.

As far as (unforgiving) parsing in AS goes, I'd use something like this:

======================================================================

--SIMPLE TAG PARSER

-------
--PRIVATE

on _tokenise(txt, delim)
set oldTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to delim
try
set lst to txt's text items
on error number -2706 -- stack overflow
error "Too many tags. AppleScript go boom."
end try
set AppleScript's text item delimiters to oldTID
return lst
end _tokenise

-------
--PUBLIC

on simpleTagParse(txt, receiverScript)
script kludge -- list access speed hack
property lst : _tokenise(txt, "<")
end script
set contentTxt to kludge's lst's first item
if contentTxt contains ">" then error "error parsing html file."
set kludge's lst to rest of kludge's lst
receiverScript's processContent(contentTxt)
repeat with chunkRef in kludge's lst
set txtChunks to _tokenise(chunkRef's contents, ">")
if (count of txtChunks) is not 2 then error "error parsing html file."
receiverScript's processTag(txtChunks's first item)
receiverScript's processContent(txtChunks's second item)
end repeat
return
end simpleTagParse

-------
--TEST

set txt to read alias "Macintosh HD:Users:has:test.html"
set receiverScript to load script alias "Macintosh
HD:Users:has:ReceiverScript.scpt"

simpleTagParse(txt, receiverScript)
return receiverScript's getResult()

======================================================================

Won't tolerate unescaped < and > - you'll need a much more sophisticated system which understands HTML to cope with that. Should be more flexible and adaptable than the Guidebook example though.

Doesn't do much; just extracts content and tags, and passes them to a second script (ReceiverScript.scpt) which can do whatever it wants with the data. Here's a simple demonstration that separates tags and contents into separate lists, but you could modify it to do almost anything with the incoming data.

======================================================================

--DEMO RECEIVER SCRIPT

(*
Used by simpleTagParse. Must contain 'processTag(txt)' and
'processContent(txt)' handlers, but you can put any code you
like into them (and the rest of the script).
*)

script _kludge -- list access speed hack
property tagsList : {}
property contentsList : {}
end script

on processTag(txt)
set _kludge's tagsList's end to txt
end processTag

on processContent(txt)
set _kludge's contentsList's end to txt
end processContent

on getResult()
return {theTags:_kludge's tagsList, theContent:_kludge's contentsList}
end getResult

======================================================================

There's also a partly-done XHTML parser + plain-text formatter on my website if you're really curious. Not really in a condition where any but the bravest of ASers could modify/extend it, but perhaps one of these days I'll get around to finishing and documenting it all. (I'm getting quite good at writing parsers, formatters, templating libraries, etc - just not so good at finding time to polish and release them...:)

has
--
http://www.barple.pwp.blueyonder.co.uk -- The Little Page of AppleScripts
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Reading input from a delimited source file
  - From: Joseph Weaks <email@hidden>

Prev by Date: Re: Parsing HTML
Next by Date: Re: Modifications and Variables
Previous by thread: Re: Parsing HTML
Next by thread: Reading input from a delimited source file
Index(es):
- Date
- Thread