Re: Reading a pdf text file
Re: Reading a pdf text file
- Subject: Re: Reading a pdf text file
- From: Roger Howard <email@hidden>
- Date: Mon, 10 Jan 2005 15:18:28 -0800
On Jan 10, 2005, at 1:41 PM, Gil Dawson wrote:
Several techniques were tried for converting a mostly-text pdf file to
text using AppleScript. This is a report on the results of these
techniques.
1. Sending the pdf file to Adobe Acrobat Elements Server
<email@hidden>. Attached to the return email is a .txt file that
has an extraneous space character at the end of every line. I did not
test the translation of special characters.
2. Executing an AppleScript that uses UI Scripting with Script Editor
1.8.3 (Classic) to control Adobe Reader 6.0 to Open, Select All, and
Next Page to copy the text page by page to another file. This script
uses System Events and so works only in Panther, even though Script
Editor 1.8.3 runs in Classic mode. The result is the same as #1,
above, without the extraneous line-end space. A difficulty with this
script is we have not yet come up with an acceptable test for the end
of the document. I did not test the translation of special
characters.
3. Executing the same script as #2, above, but using Script Editor
2.0 (X), instead of Script Editor 1.8.3 (Classic). An unusual effect
was that the System Events commands (e.g., "keystroke") reverted to
their "<<class xxxx>>" form after compiling. However, the result is
identical to #2, above.
4. Executing a variation of #2, above, in a machine booted with
9.2.2, but using Sändi's Additions instead of System Events and
Acrobat Reader 4.0 instead of Adobe reader 6.0. Sändi's Additions,
and thus this script, work only in 9.2.2 and earlier. The result is
identical to #2, above.
5. Executing pdftotext, an Open Source viewer for pdf files, executed
in aa shell script called from within an AppleScript to produce a .txt
file in the same folder. The resultant file is useable, but contains
numerous, seemingly sporadic, space characters which make parsing a
bit more difficult. I did not test the translation of special
characters.
Gil,
Sorry I skipped out on all the intervening fun; I haven't checked my
AS-Users mailbox since last week. That said, glad the others filled in
what to do with my suggestion...
The problem with the extra spaces, etc, is due to the structure of PDFs
- they aren't particularly well suited to extraction of even a simple
paragraph, let alone an entire document's worth of text. It's amazing
the tool outputs as well as it does... when Panther came out, Steve
Jobs made a big deal of all the work they did to make selecting/copying
text from PDF's in Preview.app actually usable - a lot of black magic
involved. That said, if you know the nature of the PDF (is it single or
multi columns, side bars, etc) you can tweak the pdftotext settings a
little to get better results (or worse).
For instance, adding one of these before the filename in the do shell
script command:
-layout
or
-raw
To cleanup the extra whitespace, you could either find/replace in
AppleScript... or you could output to HTML using the:
-htmlmeta
And then pipe that through a tool to generate plain text from the HTML
(though that's a bit complicated, so I'd just strip redundant
whitespace from within AppleScript).
- Roger
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Applescript-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden