I've been working with sed and awk, and I was using this little project to learn a few things, but I got really ticked that Apple's versions of the two programs are old, out-of-date, and lacking in some of the more modern GNU versions such as case-insensitive search. (I have gsed and gawk compiled with MacPorts, but that doesn't make for portable code. I need to hurry up and get to perl it seems.)
I was just going to use awk's 'tolower' function to filter the input, but I discovered that the website was case-sensitive - so I snuck in a little perl for convenience.
In about 2.5 minutes the script will create a report of all the archived desktop images on the site.
There are a few. :)
There are 2602 listed, but one of them is missing.
The script relies on TextWrangler as a display/progress mechanism (although I use BBEdit myself).
Made this project not only possible but relatively easy.
I do realize of course that it could all be done from the shell, but I'm not quite there yet (and that's for another list as well :).
In any case this was fun.
################################################################################################
# ATPM DESKTOP PICTURE ARCHIVE REPORTER
# 2012-09-09 : 05:15
################################################################################################
################################################################################################
------------------------------------------------------------------------------------------------
--» HANDLERS
------------------------------------------------------------------------------------------------
on QF(str)
return quoted form of str
end QF
------------------------------------------------------------------------------------------------
on urlPath(_url)
local _url
set {oldTIDS, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "/"}
set _url to ((text items 1 thru -2 of _url) & "") as text
set AppleScript's text item delimiters to oldTIDS
return _url
end urlPath
------------------------------------------------------------------------------------------------
--» MAIN
------------------------------------------------------------------------------------------------
set arcUrlList to do shell script "curl -sL --user-agent 'Opera/9.70 (Linux ppc64 ; U; en) Presto/2.2.1' --url " & QF(arcURL) & " \\
| sed -En '/<H1.+Desktop.Pictures.*<
\\/H1>/,/<
\\/UL>/p' | sed -En '/^ *<LI>.+·/p' \\
| perl -wlne 's|^[^\\.]+([^\"]+)\">([^<]+)</A> *· *([^<]+)</LI>.*$|\\1 •
\\2 ·
\\3|g; print' \\
| sed -E 's|^\\.{2}|" & rtURL & "|' \\
| sed -E 's|é|é|;s| *— *| — |;s| *’ *|’|'
" without altering line endings
set pgNameList to do shell script "perl -wlne 'if (m/ • (.+)/i) {print \"$1\"}' <<< " & QF(arcUrlList)
set arcUrlList to do shell script "perl -wlne 'if (m/^(.+)(?= • )/i) {print \"$1\"}' <<< " & QF(arcUrlList)
set pgNameList to paragraphs of pgNameList
set arcUrlList to paragraphs of arcUrlList
------------------------------------------------------------------------------------------------
tell application "TextWrangler"
activate
set _doc to make new document
save _doc to ("" & (path to desktop) & "ATPM Desktop Image Archive Report.txt")
set bounds of front window to {0, 44, 1615, 1196}
end tell
------------------------------------------------------------------------------------------------
repeat with i from 1 to length of arcUrlList
set _url to item i of arcUrlList
set _uPfx to urlPath(_url)
set imgList to do shell script "curl -sL --user-agent 'Opera/9.70 (Linux ppc64 ; U; en) Presto/2.2.1' --url " & QF(_url) & " \\
| perl -wlne 'if (m/a.href="" {print \"$1\"}' \\
| sed -E 's|^| " & _uPfx & "|'
"
tell application "TextWrangler"
tell _doc
set after its text to (item i of pgNameList) & tab & (item i of arcUrlList) & return & return
set after its text to imgList & return & return
select insertion point after last character
end tell
end tell
end repeat
------------------------------------------------------------------------------------------------