Regex -- TextCommands vs. Satimage
Regex -- TextCommands vs. Satimage
- Subject: Regex -- TextCommands vs. Satimage
- From: "John R." <email@hidden>
- Date: Mon, 21 Nov 2005 14:16:17 -0500
Michael Ghilissen wrote:
I can't resolve this. I need to extract the text between ">Cover<"
and the first "<br><br>" in a text string that contains several
"<br><br>".
I understand that has's TextCommands can do "lazy" regex searches,
whereas Satimage can not. As a (very appreciative!) user of these
routines (with no axe to grind...), I wanted to compare the two.
First, it seems that Michael's original problem can NOT be solved
with Satimage, if the trailing string, "<br><br>" MUST be more than a
single character "<". Does anyone disagree?
Using Satimage has forced me to think of ways to convert lazy regex,
which is fast and easy for humans to dream up, into greedy regex,
which is ugly but does the job. Michael's simple example can indeed
be solved with a harder-to-visualize greedy regex: ">cover<([^<]*)".
However, I suspect that does NOT solve Michael's actual application
need.
I suspect that Satimage made a design decision NOT to support lazy
searches in favor of speed. Here is a website reference where I
learned (what little I know) about why "lazy" is much slower than
"greedy". My intuition was the opposite, at first. (see slide #27):
http://perl.plover.com/yak/regex/
So, I tested speed with a "lazy" regex search, using a single
trailing character: "<a (.*?)>" or "<a ([^>]*)>". It will find
matches on most websites, which have a lot of extraneous text for a
good speed comparison.
Below is the testing code, with results from looking at a Google
search result:
There may be problems with my test:
(1) Satimage is a scripting addition, while TextCommands is a
background application. I suspect scripting additions to be faster,
but also more intrusive, with reserved words, etc. I actually think
this is NOT really a test problem because users must choose between
fast and convenient, including ALL of the trade-offs involved,
assuming scripting additions actually do have a speed advantage.
(2) Not doing these sorts of tests much, I don't know how to get
ticks: only whole seconds via the current date.
Results, with variations:
(a) Satimage 2x faster in than TextCommands as tested exactly as
below: 10 vs. 4
=> Satimage beats TextCommands in the basic test.
(b) Satimage 1.5x faster than TextCommands for the same regex "<a
([^>]*)>": 10 vs. 7
=> scripting additions have a slight speed advantage, but not
much. This is a surprise to me.
(c) TextCommands "<a ([^>]*)>" same speed as TextCommands "<a (.*?)
>": 10 vs. 9
=> greedy is not much faster than lazy. Also a surprise to
me, but maybe an implementation issue...
------------------------------
-- Handler for Looping and Timing
------------------------------
on TimeThis(myScript, myHTML)
set x to current date
repeat 100 times
myScript's DoThis(myHTML)
end repeat
set x2 to current date
return (x2 - x)
end TimeThis
------------------------------
-- Script for Single "lazy" search using TextCommands
------------------------------
script UsingTextCommands
on DoThis(myHTML)
tell application "TextCommands"
return (first item of (search myHTML for "<a (.*?)>" expanding to
"\\1" with regex))
end tell
end DoThis
end script
------------------------------
-- Script for Single "lazy" search using Satimage
------------------------------
script UsingSatimage
on DoThis(myHTML)
return (find text "<a ([^>]*)>" in myHTML using "\\1" with regexp
and string result)
end DoThis
end script
------------------------------
-- "Functional-Programming" style Main Routine
------------------------------
tell application "Safari"
set myHTML to source of document 1
end tell
set TextCommandsResult to my TimeThis(UsingTextCommands, myHTML)
set SatimageResult to my TimeThis(UsingSatimage, myHTML)
return "TextCommandsResult: " & TextCommandsResult & ",
SatimageResult: " & SatimageResult
--> Result: "TextCommandsResult: 10, SatimageResult: 4"
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden