Apple

Hi Thomas and all,

Very interesting how this treat goes :) and I learned out of it... for later purposes.

Like Shane also mentioned the page number extraction is also one of my targets but as I mentioned I didn't want to go into too much detail to prevent drifting away from the main subject.

Target:

--- find a regular _expression_ (Alphanumeric...) ---> push it to a flat file

--- mark hit clearly on the pdf.

It seems not possible, is that the conclusion?

Many thanks all, Julien

Sent: Tuesday, March 07, 2017 at 8:15 AM
From: "Thomas Fischer" <email@hidden>
To: "Applescript Users List" <email@hidden>
Subject: Re: collectdata

Hello,

some years ago I spent some time evaluating the different options for extracting the text from PDF files. The results varied, depending on the way of creation of the PDF file, be it Acrobat (Distiller), MS Word, pdflatex,…
None of the tested methods really preserved the formatting, but the main problem was the handling of non-ASCII characters, UTF-8 etc. The best method I found in this respect was pdfbox, an open source Apache project. If that doesn’t matter (as in this case, where Julien is looking for numbers), any method will do, e.g. getting it from Preview or Skim, pdftotext or ASObjC.
It would be fairly easy to find all occurrences in the text (e.g. by using BBEdit or TextWrangler) and get the found values so it can be searched for without using grep.
The problem is to find and *color* the respective hits in the pdf file.

> Am 07.03.2017 um 00:05 schrieb Shane Stanley <email@hidden>:
>
> That's true, but could be a good thing in this case. If you extract it this way page-by-page, you can find the point on the page where the text is by getting it's index in the returned string, and then calling -characterBoundsAtIndex:. With that information, it should in theory be possible to use ASObjC to add a highlight at that point on the page.

This would be interesting, but I can’t see how you get correct character bounds with mangled text. And is there an ASObjC method to edit or annotate PDF files? As I mentioned, Skim might work, but I didn’t get around to test that.

All the best
Thomas

> Am 06.03.2017 um 21:45 schrieb Christopher Stone <email@hidden>:
>
> On Mar 06, 2017, at 06:41, Yvan KOENIG <email@hidden> wrote:
>> No need for a third party tool to extract text from a PDF. This script delivered by Shane STANLEY doe the job.
>
> That is indeed useful – IF you only need the RAW output.
>
> The ASObjC code returns RAW output and the layout of the pdf is significantly mangled.
>
> It's the same as pdftotext's -raw output.
>
> pdftotext -raw "/path/to/your/File.pdf" -
>
> The primary reason to use pdftotext instead of other tools is its ability to preserve the fidelity of the PDF file's layout. (It's the only tool I personally know of that does this.)
>
> It's not perfect, but it can make parsing a PDF document's text relatively easy instead of difficult to impossible.
>
> pdftotext -layout "/path/to/your/File.pdf" -
>
> --
> Best Regards,
> Chris
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> AppleScript-Users mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
> Archives: http://lists.apple.com/archives/applescript-users
>
> This email sent to email@hidden

_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

References:
	>collectdata (From: Julien Battist <email@hidden>)
	>Re: collectdata (From: Shane Stanley <email@hidden>)
	>Re: collectdata (From: Thomas Fischer <email@hidden>)
	>Re: collectdata (From: Julien Battist <email@hidden>)
	>Re: collectdata (From: Yvan KOENIG <email@hidden>)
	>Re: collectdata (From: Christopher Stone <email@hidden>)
	>Re: collectdata (From: Thomas Fischer <email@hidden>)
	>Re: collectdata (From: Julien Battist <email@hidden>)