Re: pdftotext
Re: pdftotext
- Subject: Re: pdftotext
- From: "koenig.yvan" <email@hidden>
- Date: Mon, 23 Dec 2013 11:59:33 +0100
Le 22/12/2013 à 14:17, Thomas Fischer < email@hidden> a écrit : Hello,
if you want to do any serious conversion from PDF to text I would advise to try PDFbox ( http://pdfbox.apache.org/). It is much better than pdftotext or Skim or the Apple built-in tools in recognising non-ascii characters and spaces. I use a script that contains something like
set classpath to "…" set theFile to "…" set theCall to "java -Xmx1G -classpath " & classpath & " org.apache.pdfbox.ExtractText -encoding UTF-8 -sort -nonSeq " if theFormat is "Html" then set theCall to theCall & "-html " set theSuffix to ".html" else set theSuffix to ".txt" end if set the text item delimiters to {"."} set newPath to (text items 1 thru -2 of theFile as text) & "-1" & theSuffix do shell script theCall & quoted form of theFile & space & quoted form of newPath
Best Thomas
Hello Thomas
May you explain to an ass like me what is supposed to be the true value of class path.
I just downloaded pdfbox-app-1.8.3.jar
I assumes that it's the quoted form of the Posix Path of the jar file but I wish to check before running it.
No problem for theFile.
Yvan KOENIG (VALLAURIS, France) dimanche 22 décembre 2013 15:48:02
Don't worry, I got it.
set theFormat to "txt" set classpath to quoted form of POSIX path of ((path to applications folder from user domain as text) & "pdfbox-app-1.8.3.jar") set theFile to POSIX path of ((path to desktop folder as text) & "prix Gerbino.numbers - copie.pdf")
set theCall to "java -Xmx1G -classpath " & classpath & " org.apache.pdfbox.ExtractText -encoding UTF-8 -sort -nonSeq " if theFormat is "Html" then set theCall to theCall & "-html " set theSuffix to ".html" else set theSuffix to ".txt" end if set the text item delimiters to {"."} set newPath to (text items 1 thru -2 of theFile as text) & "-1" & theSuffix do shell script theCall & quoted form of theFile & space & quoted form of newPath
Yvan KOENIG (VALLAURIS, France) dimanche 22 décembre 2013 16:08:36
May I get explanations about the parameters -Xmx1G used in the code posted by Thomas Fisher ? (1) they aren’t described in the Command Line Tools web page dedicated to PdfBox (2) as I am curious, I removed them and the resulting text file is exactly the same with and without them.
Yvan KOENIG (VALLAURIS, France) lundi 23 décembre 2013 11:58:24
|
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden