Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Reading a pdf text file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reading a pdf text file

Subject: Re: Reading a pdf text file
From: Roger Howard <email@hidden>
Date: Mon, 10 Jan 2005 15:18:28 -0800


On Jan 10, 2005, at 1:41 PM, Gil Dawson wrote:

Several techniques were tried for converting a mostly-text pdf file to text using AppleScript. This is a report on the results of these techniques.

1. Sending the pdf file to Adobe Acrobat Elements Server <email@hidden>. Attached to the return email is a .txt file that has an extraneous space character at the end of every line. I did not test the translation of special characters.

2. Executing an AppleScript that uses UI Scripting with Script Editor 1.8.3 (Classic) to control Adobe Reader 6.0 to Open, Select All, and Next Page to copy the text page by page to another file. This script uses System Events and so works only in Panther, even though Script Editor 1.8.3 runs in Classic mode. The result is the same as #1, above, without the extraneous line-end space. A difficulty with this script is we have not yet come up with an acceptable test for the end of the document. I did not test the translation of special characters.

3. Executing the same script as #2, above, but using Script Editor 2.0 (X), instead of Script Editor 1.8.3 (Classic). An unusual effect was that the System Events commands (e.g., "keystroke") reverted to their "<<class xxxx>>" form after compiling. However, the result is identical to #2, above.

4. Executing a variation of #2, above, in a machine booted with 9.2.2, but using Sändi's Additions instead of System Events and Acrobat Reader 4.0 instead of Adobe reader 6.0. Sändi's Additions, and thus this script, work only in 9.2.2 and earlier. The result is identical to #2, above.

5. Executing pdftotext, an Open Source viewer for pdf files, executed in aa shell script called from within an AppleScript to produce a .txt file in the same folder. The resultant file is useable, but contains numerous, seemingly sporadic, space characters which make parsing a bit more difficult. I did not test the translation of special characters.


Gil,

Sorry I skipped out on all the intervening fun; I haven't checked my AS-Users mailbox since last week. That said, glad the others filled in what to do with my suggestion...

The problem with the extra spaces, etc, is due to the structure of PDFs - they aren't particularly well suited to extraction of even a simple paragraph, let alone an entire document's worth of text. It's amazing the tool outputs as well as it does... when Panther came out, Steve Jobs made a big deal of all the work they did to make selecting/copying text from PDF's in Preview.app actually usable - a lot of black magic involved. That said, if you know the nature of the PDF (is it single or multi columns, side bars, etc) you can tweak the pdftotext settings a little to get better results (or worse).

For instance, adding one of these before the filename in the do shell script command:

-layout
or
-raw

To cleanup the extra whitespace, you could either find/replace in AppleScript... or you could output to HTML using the:

-htmlmeta

And then pipe that through a tool to generate plain text from the HTML (though that's a bit complicated, so I'd just strip redundant whitespace from within AppleScript).

- Roger
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



References:  
  >Re: Reading a pdf text file (From: Martin Orpen <email@hidden>)
  >Re: Reading a pdf text file (From: Gil Dawson <email@hidden>)
  >Re: Reading a pdf text file (From: Gil Dawson <email@hidden>)




Prev by Date:
Re: A request

Next by Date:
Re: A request

Previous by thread:
Re: Reading a pdf text file

Next by thread:
Re: Reading a pdf text file

Index(es):

Date
Thread