Re: Parsing comments from HTML...
Re: Parsing comments from HTML...
- Subject: Re: Parsing comments from HTML...
- From: has <email@hidden>
- Date: Fri, 1 Nov 2002 20:41:52 +0000
Peter Bunn wrote:
>
I'm writing a script that tries to retrieve text which has been commented
>
out in HTML [...] but as the amount of HTML grows, everything slows to a
>
crawl
>
(roughly 2 minutes to retrieve 100 items from an HTML page of 100K).
AppleScript is much too slow for iterating across each string in the
character to be practical, unless the strings are very short.
>
I've tried other methods - involving 'read to the offset of' and tid's,
>
but haven't had much luck... mostly just shots in the dark, owing to my
>
inexperience.
TID shuffle is fastest. (Wish it weren't necessary, but that's AS for you...:/)
>
I wonder if there's a way to speed up the process?
Here's a quick hack should do you. It ain't beautiful, but should be pretty
fast + efficient. (In addition to doing the TID thing, it also does whacky
things to prevent large lists dragging performance down [1].)
======================================================================
on extractComments(htm)
set oldTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to "<!--"
script kludge -- list access speed hack, part 1
property lst : rest of htm's text items
property res : {}
end script
set AppleScript's text item delimiters to "-->"
tell kludge -- list access speed hack, part 2
repeat with strRef in its lst
set its res's end to strRef's first text item
end repeat
end tell
set AppleScript's text item delimiters to oldTID
return kludge's res
end extractComments
======================================================================
One warning: the"htm's text items" bit in the above code will blow up if
there's more than ~4000 comments in the string [2]. You can protect
against that by using the everyItemLib library from my site:
======================================================================
property everyItemLib : load script (alias "path to everyItemLib")
on extractComments(htm)
...
property lst : rest of everyItemLib's everyTextItem(htm)
...
end extractComments
======================================================================
>
As an added bonus, if there's a way to sort the final list
>
alphabetically, that would be of great interest also.
Grab yourself a copy of Serge's qSort library from AppleMods (over at
macscripter.net):
======================================================================
property qSort : load script (alias "path to qSort library")
...
set the_read to "your html here"
set commentsList to extractComments(the_read)
set sortedList to qsort's qsort(commentsList)
======================================================================
>
(I've left the HTML comment symbols out in case the list server wouldn't
>
handle them properly...)
List server will handle < and > fine. It's just anything over ASCII127 that
it chokes on.
HTH
has
[1] AS's list type has poor performance characteristics: the time taken to
look up a list item increases as the list grows longer. The hacky stuff
with the script object (another trick brought to us by the mighty Serge)
makes the access time constant.
[2] Another known problem: AS reports a stack overflow error when
tokenising a string into more than approx. 4000 items.
--
http://www.barple.pwp.blueyonder.co.uk -- The Little Page of AppleScripts
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.