if you want to do any serious conversion from PDF to text I would advise to try PDFbox (
http://pdfbox.apache.org/). It is much better than pdftotext or Skim or the Apple built-in tools in recognising non-ascii characters and spaces. I use a script that contains something like
set classpath to "…"
set theFile to "…"
set theCall to "java -Xmx1G -classpath " & classpath & " org.apache.pdfbox.ExtractText -encoding UTF-8 -sort -nonSeq "
if theFormat is "Html" then
set theCall to theCall & "-html "
set theSuffix to ".html"
else
set theSuffix to ".txt"
end if
set the text item delimiters to {"."}
set newPath to (text items 1 thru -2 of theFile as text) & "-1" & theSuffix
do shell script theCall & quoted form of theFile & space & quoted form of newPath
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives:
http://lists.apple.com/archives/applescript-users
This email sent to email@hidden