• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: PDF Search
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PDF Search


  • Subject: Re: PDF Search
  • From: "Jacopille, David" <email@hidden>
  • Date: Wed, 28 Oct 2015 18:45:46 +0000
  • Thread-topic: PDF Search

Love regular expressions and grep.  They are not going to be useful for reading the raw PDF though, even if wasn’t completely binary.

1.  Most PDFs have lots of puddles of compression on individual objects throughout.
2.  It would be difficult to guarantee all textual content in PDF remains uncompressed in all cases.
3.  With very short patterns, like the “FCC” example David gave, the risk of getting hits in PDF code or binary graphics is real.
4.  Lines, words, and even letters in PDFs are often surrounded by the page description language that is PDF.  

Regarding the last issue: searching on a larger pattern like “FCC regulations” in the PDF source would always fail if there was a line break between the two words in the layout of the PDF.  You cannot guarantee even individual letters are not surrounded by layout code.

One of the suggested search tools that can handle PDF compression, and that sees the readable text of the PDF as cohesive sentences and paragraphs is your best bet.

Another approach would be to extract the text from the PDF (lots of tools for that) and then grep.  With this approach you can confirm the readable text of the PDF is being identified correctly.  And turns the search into an easier problem with a broader range of solutions.

Dave

From: Philip Ershler <email@hidden>
Date: Wednesday, October 28, 2015 at 1:58 PM
To: AppleScript Users <email@hidden>
Cc: "email@hidden" <email@hidden>, David Crowe <email@hidden>
Subject: Re: PDF Search

Why wouldn’t grep or one of its relatives do this for you? If you look at the man page for grep it looks like it could (as long as it’s not a binary pdf)

     The grep utility searches any given input files, selecting lines that
     match one or more patterns.  By default, a pattern matches an input line
     if the regular _expression_ (RE) in the pattern matches the input line
     without its trailing newline.  An empty _expression_ matches every line.
     Each input line that matches at least one of the patterns is written to
     the standard output.

  -c, --count
             Only a count of selected lines is written to standard output.


Also see https://pdfgrep.org

HTH, Phil


On Oct 28, 2015, at 11:49 AM, David Crowe <email@hidden> wrote:

Emanuel;

Using smile may be the best solution … but the “all occurrences” flag didn’t seem to work for me. I search for a simple 3 letter string (“FCC”) in a PDF file that had about 20 occurrences and it only produced one of them.

It also seemed unable to find more than the first string. If I put in “starting at” set to the position of the first find, plus one, it generated an error, indicating that there were no further hits. But there clearly are. And most of the occurrences of the string are in the same case as the search string I provided.

- David


------------------------------

Message: 3
Date: Wed, 28 Oct 2015 13:03:54 +0100
From: Emmanuel LEVY <email@hidden>
To: AppleScript Listserv <email@hidden>
Subject: Re: PDF Search
Message-ID: <email@hidden" class="">email@hidden>
Content-Type: text/plain; charset=us-ascii

In addition to the excellent replies you got, you can use Smile - that's a free AppleScript environment - like in:

find text [ some string here ] in [ some path to a pdf file here ]

Best,
Emmanuel

The latest version of Smile is available at http://satimage.fr/software/en/downloads/downloads_nextsmile.html



_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden


MFS Email system made the following annotation
---------------------------------------------------------------------------------------------------------------------------------------
This email communication and any attachments may contain proprietary, confidential, or privileged information. If you are not the intended recipient, you are hereby notified that you have received this email in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. The sender does not waive confidentiality or any privilege by mistransmission. If you have received this email in error, please notify the sender immediately, delete this email, and destroy all copies and any attachments.
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

References: 
 >Re: PDF Search (From: Philip Ershler <email@hidden>)

  • Prev by Date: Re: PDF Search
  • Next by Date: Re: Wan IP Retrieval No Longer Works For Me Under OS X 10.11
  • Previous by thread: Re: PDF Search
  • Next by thread: Re: PDF Search
  • Index(es):
    • Date
    • Thread