• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: pdftotext
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: pdftotext


  • Subject: Re: pdftotext
  • From: Thomas Fischer <email@hidden>
  • Date: Thu, 26 Dec 2013 17:48:44 +0100

Hello Yvan,

sorry if I've been unclear, I was in a bit of a hurry and this was just a snippet from a larger script I used to compare different option to convert PDF to text – in which PDFbox was the clear winner.

-Xmx1G 
is a parameter for the java call to reserve 1 GB of memory (Heapspace) for the process.
This will usually not make a difference if you leave it out, unless you run out of memory, that is.

I used this for the conversion of hundreds of (mathematical) PDF files, so for the general user this will be too high, I suppose. The explanation is in the java man page:

−Xmsn
Specifies the initial size of the memory allocation pool. This value must be a multiple of 1024 greater than 1 MB. Append the letter k or K to indicate kilobytes, the letter m or M to indicate megabytes, the letter g or G to indicate gigabytes, or the letter t or T to indicate terabytes. The default value is 2MB. Examples:
−Xms6291456
−Xms6144k
−Xms6m

And no, I don't think that you can use PDFBox without the Java engine.
But the source code is freely available, if you want to try…

Best
Thomas

set theFormat to "txt"
set classpath to quoted form of POSIX path of ((path to applications folder from user domain as text) & "pdfbox-app-1.8.3.jar")
set theFile to POSIX path of ((path to desktop folder as text) & "prix Gerbino.numbers - copie.pdf")

set theCall to "java -Xmx1G -classpath " & classpath & " org.apache.pdfbox.ExtractText -encoding UTF-8 -sort -nonSeq "
if theFormat is "Html" then
set theCall to theCall & "-html "
set theSuffix to ".html"
else
set theSuffix to ".txt"
end if
set the text item delimiters to {"."}
set newPath to (text items 1 thru -2 of theFile as text) & "-1" & theSuffix
do shell script theCall & quoted form of theFile & space & quoted form of newPath

Yvan KOENIG (VALLAURIS, France) dimanche 22 décembre 2013 16:08:36



May I get explanations about the parameters -Xmx1G used in the code posted by Thomas Fisher ?
(1) they aren’t described in the Command Line Tools web page dedicated to PdfBox
(2) as I am curious, I removed them and the resulting text file is exactly the same with and without them.

Yvan KOENIG (VALLAURIS, France) lundi 23 décembre 2013 11:58:24

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

  • Follow-Ups:
    • Re: pdftotext
      • From: Paul Berkowitz <email@hidden>
References: 
 >pdftotext (From: Christopher Stone <email@hidden>)
 >Re: pdftotext (From: Shane Stanley <email@hidden>)
 >Re: pdftotext (From: Christopher Stone <email@hidden>)
 >Re: pdftotext (From: Thomas Fischer <email@hidden>)
 >Re: pdftotext (From: "koenig.yvan" <email@hidden>)
 >Re: pdftotext (From: "koenig.yvan" <email@hidden>)
 >Re: pdftotext (From: "koenig.yvan" <email@hidden>)

  • Prev by Date: Re: puzzling Safari
  • Next by Date: Re: pdftotext
  • Previous by thread: Re: pdftotext
  • Next by thread: Re: pdftotext
  • Index(es):
    • Date
    • Thread