Love regular expressions and grep. They are not going to be useful for reading the raw PDF though, even if wasn’t completely binary.
1. Most PDFs have lots of puddles of compression on individual objects throughout.
2. It would be difficult to guarantee all textual content in PDF remains uncompressed in all cases.
3. With very short patterns, like the “FCC” example David gave, the risk of getting hits in PDF code or binary graphics is real.
4. Lines, words, and even letters in PDFs are often surrounded by the page description language that is PDF.
Regarding the last issue: searching on a larger pattern like “FCC regulations” in the PDF source would always fail if there was a line break between the two words in the layout of the PDF. You cannot guarantee even individual letters are not surrounded
by layout code.
One of the suggested search tools that can handle PDF compression, and that sees the readable text of the PDF as cohesive sentences and paragraphs is your best bet.
Another approach would be to extract the text from the PDF (lots of tools for that) and then grep. With this approach you can confirm the readable text of the PDF is being identified correctly. And turns the search into an easier problem with a broader
range of solutions.
Dave
Why wouldn’t grep or one of its relatives do this for you? If you look at the man page for grep it looks like it could (as long as it’s not a binary pdf)
The grep utility searches any given input files, selecting lines that
match one or more patterns. By default, a pattern matches an input line
if the regular _expression_ (RE) in the pattern matches the input line
without its trailing newline. An empty _expression_ matches every line.
Each input line that matches at least one of the patterns is written to
the standard output.
-c, --count
Only a count of selected lines is written to standard output.
HTH, Phil
On Oct 28, 2015, at 11:49 AM, David Crowe < email@hidden> wrote:
Emanuel;
Using smile may be the best solution … but the “all occurrences” flag didn’t seem to work for me. I search for a simple 3 letter string (“FCC”) in a PDF file that had about 20 occurrences and it only produced one of them.
It also seemed unable to find more than the first string. If I put in “starting at” set to the position of the first find, plus one, it generated an error, indicating that there were no further hits. But there clearly are. And most of the occurrences of the
string are in the same case as the search string I provided.
- David
------------------------------
Message: 3
Date: Wed, 28 Oct 2015 13:03:54 +0100
From: Emmanuel LEVY <email@hidden>
To: AppleScript Listserv <email@hidden>
Subject: Re: PDF Search
Message-ID: <email@hidden" class="">email@hidden>
Content-Type: text/plain; charset=us-ascii
In addition to the excellent replies you got, you can use Smile - that's a free AppleScript environment - like in:
find text [ some string here ] in [ some path to a pdf file here ]
Best,
Emmanuel
The latest version of Smile is available at
http://satimage.fr/software/en/downloads/downloads_nextsmile.html
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list ( email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden
MFS Email system made the following annotation
---------------------------------------------------------------------------------------------------------------------------------------
This email communication and any attachments may contain proprietary, confidential, or privileged information. If you are not the intended recipient, you are hereby notified that you have received this email in error and that any review, disclosure, dissemination,
distribution or copying of it or its contents is prohibited. The sender does not waive confidentiality or any privilege by mistransmission. If you have received this email in error, please notify the sender immediately, delete this email, and destroy all copies
and any attachments.
|